将 HTML 转换为 XHTML 并清理不必要的标签和属性

Omar Al Zabir

4.72/5 (20投票s)

2005年6月24日

CPOL

2分钟阅读

241892

3860

在应用标签和属性过滤器将HTML转换为XHTML，以便为网页发布生成干净整洁的HTML。

下载源文件 - 48.3 KB

引言

这是一个类库，可以帮助您从HTML生成有效的XHTML。它还提供标签和属性过滤支持。您可以精确指定输出中允许的标签和属性，其他标签将被过滤掉。您可以使用此库来清理Microsoft Word文档转换为HTML时生成的大量HTML。您也可以在发布到博客网站之前使用它来清理HTML，以防止您的HTML被WordPress、B2evolution等博客引擎拒绝。

工作原理

有两个类：HtmlReader和HtmlWriter。

HtmlReader扩展了Chris Clovett著名的SgmlReader。当它读取HTML时，它会跳过任何具有某种前缀的节点。因此，所有那些讨厌的标签，例如<o:p>、<o:Document>、<st1:personname>以及数百个其他标签都会被过滤掉。因此，您读取的HTML不包含非核心HTML标签。

HtmlWriter扩展了常规的XmlWriter，使其生成XML。XHTML基本上是XML格式的HTML。您使用的所有常用标签——例如<img>、<br>和<hr>，它们没有结束标签——在XHTML中必须采用空元素格式，即<img .. />、<br/>和<hr/>。由于XHTML是一个格式良好的XML文档，您可以轻松地使用XML解析器读取XHTML文档。这使您可以应用XPath搜索。

HtmlReader

HtmlReader非常简单。这是整个类

/// <summary>
/// This class skips all nodes which has some
/// kind of prefix. This trick does the job 
/// to clean up MS Word/Outlook HTML markups.
/// </summary>
public class HtmlReader : Sgml.SgmlReader
{
    public HtmlReader( TextReader reader ) : base( )
    {
        base.InputStream = reader;
        base.DocType = "HTML";
    }
    public HtmlReader( string content ) : base( )
    {
        base.InputStream = new StringReader( content );
        base.DocType = "HTML";
    }
    public override bool Read()
    {
        bool status = base.Read();
        if( status )
        {
            if( base.NodeType == XmlNodeType.Element )
            {
                // Got a node with prefix. This must be one
                // of those "<o:p>" or something else.
                // Skip this node entirely. We want prefix
                // less nodes so that the resultant XML 
                // requires not namespace.
                if( base.Name.IndexOf(':') > 0 )
                    base.Skip();
            }
        }
        return status;
    }
}

HtmlWriter

这个类有点棘手。这里有一些使用技巧

重写XmlWriter的WriteString方法，并阻止它使用常规XML编码对内容进行编码。HTML文档的编码是手动完成的。
重写WriteStartElement以阻止将不允许的标签写入输出。
重写WriteAttributes以防止不需要的属性。

让我们逐部分看看整个类

可配置性

您可以通过修改以下代码块来配置HtmlWriter

public class HtmlWriter : XmlTextWriter
{
    /// <summary>
    /// If set to true, it will filter the output
    /// by using tag and attribute filtering,
    /// space reduce etc
    /// </summary>
    public bool FilterOutput = false;
    /// <summary>
    /// If true, it will reduce consecutive &nbsp; with one instance
    /// </summary>
    public bool ReduceConsecutiveSpace = true;
    /// <summary>
    /// Set the tag names in lower case which are allowed to go to output
    /// </summary>
    public string [] AllowedTags = 
        new string[] { "p", "b", "i", "u", "em", "big", "small", 
        "div", "img", "span", "blockquote", "code", "pre", "br", "hr", 
        "ul", "ol", "li", "del", "ins", "strong", "a", "font", "dd", "dt"};
    /// <summary>
    /// If any tag found which is not allowed, it is replaced by this tag.
    /// Specify a tag which has least impact on output
    /// </summary>
    public string ReplacementTag = "dd";
    /// <summary>
    /// New lines \r\n are replaced with space 
    /// which saves space and makes the
    /// output compact
    /// </summary>
    public bool RemoveNewlines = true;
    /// <summary>
    /// Specify which attributes are allowed. 
    /// Any other attribute will be discarded
    /// </summary>
    public string [] AllowedAttributes = new string[] 
    { 
        "class", "href", "target", "border", "src", 
        "align", "width", "height", "color", "size" 
    };
}

WriteString 方法

/// <summary>
/// The reason why we are overriding
/// this method is, we do not want the output to be
/// encoded for texts inside attribute
/// and inside node elements. For example, all the &nbsp;
/// gets converted to &nbsp in output. But this does not 
/// apply to HTML. In HTML, we need to have &nbsp; as it is.
/// </summary>
/// <param name="text"></param>
public override void WriteString(string text)
{
    // Change all non-breaking space to normal space
    text = text.Replace( " ", "&nbsp;" );
    /// When you are reading RSS feed and writing Html, 
    /// this line helps remove those CDATA tags
    text = text.Replace("<![CDATA[","");
    text = text.Replace("]]>", "");

    // Do some encoding of our own because
    // we are going to use WriteRaw which won't
    // do any of the necessary encoding
    text = text.Replace( "<", "<" );
    text = text.Replace( ">", ">" );
    text = text.Replace( "'", "&apos;" );
    text = text.Replace( "\"", ""e;" );

    if( this.FilterOutput )
    {
        text = text.Trim();

        // We want to replace consecutive spaces
        // to one space in order to save horizontal width
        if( this.ReduceConsecutiveSpace ) 
            text = text.Replace("&nbsp;&nbsp;&nbsp;", "&nbsp;");
        if( this.RemoveNewlines ) 
            text = text.Replace(Environment.NewLine, " ");

        base.WriteRaw( text );
    }
    else
    {
        base.WriteRaw( text );
    }
}

WriteStartElement：应用标签过滤

public override void WriteStartElement(string prefix, 
    string localName, string ns)
{
    if( this.FilterOutput ) 
    {
        bool canWrite = false;
        string tagLocalName = localName.ToLower();
        foreach( string name in this.AllowedTags )
        {
            if( name == tagLocalName )
            {
                canWrite = true;
                break;
            }
        }
        if( !canWrite ) 
        localName = "dd";
    }
    base.WriteStartElement(prefix, localName, ns);
}

WriteAttributes 方法：应用属性过滤

bool canWrite = false;
string attributeLocalName = reader.LocalName.ToLower();
foreach( string name in this.AllowedAttributes )
{
    if( name == attributeLocalName )
    {
        canWrite = true;
        break;
    }
}
// If allowed, write the attribute
if( canWrite ) 
    this.WriteStartAttribute(reader.Prefix, 
    attributeLocalName, reader.NamespaceURI);
while (reader.ReadAttributeValue())
{
    if (reader.NodeType == XmlNodeType.EntityReference)
    {
        if( canWrite ) this.WriteEntityRef(reader.Name);
        continue;
    }
    if( canWrite )this.WriteString(reader.Value);
}
if( canWrite ) this.WriteEndAttribute();

结论

示例应用程序是一个实用程序，您可以立即使用它来清理HTML文件。您可以在博客工具等应用程序中使用此类，在这些应用程序中，您需要将HTML发布到某些Web服务。