将 HTML 转换为 XHTML 并清理不必要的标签和属性






4.72/5 (20投票s)
在应用标签和属性过滤器将HTML转换为XHTML,以便为网页发布生成干净整洁的HTML。
引言
这是一个类库,可以帮助您从HTML生成有效的XHTML。它还提供标签和属性过滤支持。您可以精确指定输出中允许的标签和属性,其他标签将被过滤掉。您可以使用此库来清理Microsoft Word文档转换为HTML时生成的大量HTML。您也可以在发布到博客网站之前使用它来清理HTML,以防止您的HTML被WordPress、B2evolution等博客引擎拒绝。
工作原理
有两个类:HtmlReader
和HtmlWriter
。
HtmlReader
扩展了Chris Clovett著名的SgmlReader。当它读取HTML时,它会跳过任何具有某种前缀的节点。因此,所有那些讨厌的标签,例如<o:p>
、<o:Document>
、<st1:personname>
以及数百个其他标签都会被过滤掉。因此,您读取的HTML不包含非核心HTML标签。
HtmlWriter
扩展了常规的XmlWriter
,使其生成XML。XHTML基本上是XML格式的HTML。您使用的所有常用标签——例如<img>
、<br>
和<hr>
,它们没有结束标签——在XHTML中必须采用空元素格式,即<img .. />
、<br/>
和<hr/>
。由于XHTML是一个格式良好的XML文档,您可以轻松地使用XML解析器读取XHTML文档。这使您可以应用XPath搜索。
HtmlReader
HtmlReader
非常简单。这是整个类
/// <summary>
/// This class skips all nodes which has some
/// kind of prefix. This trick does the job
/// to clean up MS Word/Outlook HTML markups.
/// </summary>
public class HtmlReader : Sgml.SgmlReader
{
public HtmlReader( TextReader reader ) : base( )
{
base.InputStream = reader;
base.DocType = "HTML";
}
public HtmlReader( string content ) : base( )
{
base.InputStream = new StringReader( content );
base.DocType = "HTML";
}
public override bool Read()
{
bool status = base.Read();
if( status )
{
if( base.NodeType == XmlNodeType.Element )
{
// Got a node with prefix. This must be one
// of those "<o:p>" or something else.
// Skip this node entirely. We want prefix
// less nodes so that the resultant XML
// requires not namespace.
if( base.Name.IndexOf(':') > 0 )
base.Skip();
}
}
return status;
}
}
HtmlWriter
这个类有点棘手。这里有一些使用技巧
- 重写
XmlWriter
的WriteString
方法,并阻止它使用常规XML编码对内容进行编码。HTML文档的编码是手动完成的。 - 重写
WriteStartElement
以阻止将不允许的标签写入输出。 - 重写
WriteAttributes
以防止不需要的属性。
让我们逐部分看看整个类
可配置性
您可以通过修改以下代码块来配置HtmlWriter
public class HtmlWriter : XmlTextWriter
{
/// <summary>
/// If set to true, it will filter the output
/// by using tag and attribute filtering,
/// space reduce etc
/// </summary>
public bool FilterOutput = false;
/// <summary>
/// If true, it will reduce consecutive with one instance
/// </summary>
public bool ReduceConsecutiveSpace = true;
/// <summary>
/// Set the tag names in lower case which are allowed to go to output
/// </summary>
public string [] AllowedTags =
new string[] { "p", "b", "i", "u", "em", "big", "small",
"div", "img", "span", "blockquote", "code", "pre", "br", "hr",
"ul", "ol", "li", "del", "ins", "strong", "a", "font", "dd", "dt"};
/// <summary>
/// If any tag found which is not allowed, it is replaced by this tag.
/// Specify a tag which has least impact on output
/// </summary>
public string ReplacementTag = "dd";
/// <summary>
/// New lines \r\n are replaced with space
/// which saves space and makes the
/// output compact
/// </summary>
public bool RemoveNewlines = true;
/// <summary>
/// Specify which attributes are allowed.
/// Any other attribute will be discarded
/// </summary>
public string [] AllowedAttributes = new string[]
{
"class", "href", "target", "border", "src",
"align", "width", "height", "color", "size"
};
}
WriteString 方法
/// <summary>
/// The reason why we are overriding
/// this method is, we do not want the output to be
/// encoded for texts inside attribute
/// and inside node elements. For example, all the
/// gets converted to   in output. But this does not
/// apply to HTML. In HTML, we need to have as it is.
/// </summary>
/// <param name="text"></param>
public override void WriteString(string text)
{
// Change all non-breaking space to normal space
text = text.Replace( " ", " " );
/// When you are reading RSS feed and writing Html,
/// this line helps remove those CDATA tags
text = text.Replace("<![CDATA[","");
text = text.Replace("]]>", "");
// Do some encoding of our own because
// we are going to use WriteRaw which won't
// do any of the necessary encoding
text = text.Replace( "<", "<" );
text = text.Replace( ">", ">" );
text = text.Replace( "'", "'" );
text = text.Replace( "\"", ""e;" );
if( this.FilterOutput )
{
text = text.Trim();
// We want to replace consecutive spaces
// to one space in order to save horizontal width
if( this.ReduceConsecutiveSpace )
text = text.Replace(" ", " ");
if( this.RemoveNewlines )
text = text.Replace(Environment.NewLine, " ");
base.WriteRaw( text );
}
else
{
base.WriteRaw( text );
}
}
WriteStartElement:应用标签过滤
public override void WriteStartElement(string prefix,
string localName, string ns)
{
if( this.FilterOutput )
{
bool canWrite = false;
string tagLocalName = localName.ToLower();
foreach( string name in this.AllowedTags )
{
if( name == tagLocalName )
{
canWrite = true;
break;
}
}
if( !canWrite )
localName = "dd";
}
base.WriteStartElement(prefix, localName, ns);
}
WriteAttributes 方法:应用属性过滤
bool canWrite = false;
string attributeLocalName = reader.LocalName.ToLower();
foreach( string name in this.AllowedAttributes )
{
if( name == attributeLocalName )
{
canWrite = true;
break;
}
}
// If allowed, write the attribute
if( canWrite )
this.WriteStartAttribute(reader.Prefix,
attributeLocalName, reader.NamespaceURI);
while (reader.ReadAttributeValue())
{
if (reader.NodeType == XmlNodeType.EntityReference)
{
if( canWrite ) this.WriteEntityRef(reader.Name);
continue;
}
if( canWrite )this.WriteString(reader.Value);
}
if( canWrite ) this.WriteEndAttribute();
结论
示例应用程序是一个实用程序,您可以立即使用它来清理HTML文件。您可以在博客工具等应用程序中使用此类,在这些应用程序中,您需要将HTML发布到某些Web服务。