65.9K
CodeProject 正在变化。 阅读更多。
Home

将 HTML 文件转换为 XHTML 文件

starIconstarIcon
emptyStarIcon
starIcon
emptyStarIconemptyStarIcon

2.47/5 (5投票s)

2007年3月12日

3分钟阅读

viewsIcon

76982

downloadIcon

1001

将 HTML 文件转换为 XHTML 文件

引言

我们网站上的文章主要是 HTML 4.0 格式,但是,它们中的许多不符合 W3C 标准;这些文章中有很多错误的标签,我想将这些文件转换为 XHTML 文件,以便符合 W3C 标准。

有时,我想从网页中提取一些信息。如果网页是 XHTML 格式,那么我可以更轻松地获取信息,因为我可以使用 XML 文档原型来解析该文件。

背景

有几个工具可以将 HTML 转换为 XHTML。 Dreamweaver 能够通过使用 File-->Convert 菜单来转换文件。但是 Dreamweaver 存在一些问题:它不是免费的,有时 Dreamweaver 无法修复一些错误。此外,您可以使用一个名为“HTML Tidy”的免费著名工具。但是,HTML Tidy 只能处理某些语言。 本文基于 HTML Tidy。

由于 XHTML 2.0 与 HTML 和 XHTML 1.0 不兼容,因此它没有被广泛使用。例如,.NET Web 应用程序的默认模式是 XHTML 1.0。在本文中,XHTML 指的是 XHTML 1.0 过渡格式。

步骤一. 将 HTML 文件转换为 UTF-8 格式

为了处理所有语言,我们首先必须将文件转换为 UTF-8 格式。(注意:如果源文件已经是 UTF-8 格式,那么您可以忽略此步骤。)

我们可以使用 FileStreamBinaryReader 类将 HTML 文件读取为字节数组,然后将其转换为 UTF-8 String

这里,我们假设 HTML 编码方式是操作系统的默认编码。

/// <summary>
/// read all the content from a file as byte array
/// </summary>
/// <param name="strFilePath">source file path</param>
/// <returns>dest byte array on succeed</returns>
public static byte[] ReadFileAsBytes(String strFilePath)
{
    System.IO.FileStream fs = new System.IO.FileStream(strFilePath, 
        System.IO.FileMode.Open, System.IO.FileAccess.Read, 
        System.IO.FileShare.ReadWrite);
    System.IO.BinaryReader br = new System.IO.BinaryReader(fs);
    byte[] baResult = null;
    try
    {
        baResult = new byte[fs.Length];
        br.Read(baResult, 0, baResult.Length);
    }
    finally
    {
        br.Close();
        fs.Close();
    }
    return baResult;
}
/// <summary>
/// convert a byte array to string using default encoding
/// </summary>
/// <param name="bData">the content of the array</param>
/// <returns>converted string</returns>
public static String BytesToString(byte[] bData)
{
    return System.Text.Encoding.GetEncoding(0).GetString(bData);
}

步骤二. 将文件转换为 XHTML

我们使用 HTML Tidy 将 HTML 文件转换为 XHTML 文件。 Tidy 有很多参数。 如果你想知道细节,你可以阅读手册。

如果我们要将 UTF-8 HTML 文件转换为 XHTML 文件,您可以这样使用它

tidy.exe -raw -utf8 -asxhtml -i -f logfilename -o outputfilename inputfilename

通过使用 System.Diagnostics.Process 类,您启动一个进程,指出输入文件和输出文件的指定名称,并将整个输出读取为转换后的 XHTML 文件。 如果输出文件不存在,则输入文件可能存在服务器错误,在这种情况下,您可能需要手动检查它。

/// <summary>
/// This method convert a html file to an xhtml file
/// </summary>
/// <param name="strOriginalContent">input html file</param>
/// <param name="strTempPath">Temppath,if this parameter is 
/// null, then it refers to the temp path of the system</param>
/// <returns>converted xhtml file content from input file</returns>
public static String HTML2XHTML(String strOriginalContent,String strOutputPath)
{
    String strTempPath = strOutputPath != null ? strOutputPath : 
        System.IO.Path.GetTempPath();

    String strFileName = String.Format("{0}tidy.exe",strTempPath);
    //check wether tidy executable exists
    if (!System.IO.File.Exists(strFileName))
    {
        ChinaCars.Util.SysUtil.WriteFile(strFileName,
            ChinaCars.Util.App_GlobalResources.Resource.tidy);
    }

    //Create process
    System.Diagnostics.ProcessStartInfo psiInfo = 
        new System.Diagnostics.ProcessStartInfo();
    psiInfo.FileName = strFileName;
    psiInfo.CreateNoWindow = true;
    psiInfo.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
    psiInfo.WorkingDirectory = strTempPath;

    String strMainFileName = System.Guid.NewGuid().ToString("N");
    //Specify the in/out/error file name, which is located in the temporary 
    //path
    String strInFileName = String.Format("{0}{1}.in", 
        strTempPath,strMainFileName);
    String strOutFileName = String.Format("{0}{1}.out", 
        strTempPath,strMainFileName);
    String strErrorFileName = String.Format("{0}{1}.log", 
        strTempPath,strMainFileName);
    System.IO.File.Delete(strInFileName);
    //UTF8 Version, and we suppose the original content is encoded though the  
    //default encoding of the system
    byte[] baUTF8Data = Encoding.Convert(Encoding.GetEncoding(0), 
        Encoding.UTF8, Encoding.GetEncoding(0).GetBytes(strOriginalContent));
    ChinaCars.Util.SysUtil.WriteFile(strInFileName, baUTF8Data);

    //UTF8 Version
    psiInfo.Arguments = String.Format(" -raw -utf8 -asxhtml -i -f 
        {0}.log -o {0}.out {0}.in", strMainFileName);
    System.IO.File.Delete(strOutFileName);
    System.Diagnostics.Process proc = 
        System.Diagnostics.Process.Start(psiInfo);
    proc.WaitForExit();
    System.IO.File.Delete(strInFileName);
    System.IO.File.Delete(strErrorFileName);

    byte[] baResult = ChinaCars.Util.SysUtil.ReadFileAsBytes(strOutFileName);
    //We need a head for xhtml processing
    String strContent = 
        Encoding.GetEncoding(0).GetString(Encoding.Convert(Encoding.UTF8, 
            Encoding.GetEncoding(0), baResult));
    strContent = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 
        Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-
        transitional.dtd\">" + strContent;
    System.IO.File.Delete(strOutFileName);
    return strContent;
}

步骤三. 开发您自己的 XHTML 解析器

现在您可以使用 System.Xml.XmlDocument 类来加载 XHTML 文档,但是您可能会发现加载过程非常长! 有时,LoadXML 将失败! 为什么?

XHTML 中的 DOCTYPE 标头告诉 .NET XML 解析器从 万维网联盟 (W3C) 加载相应的文件资源,并且可能需要几轮或更多轮! 幸运的是,.NET Framework 允许我们自己解析 XML 文件。 通过重写 XmlRelolverResolveUriGetEntity,我们可以减少 XHTML 加载时间。 代码如下所示

public class XHTMLResolver:XmlResolver
{
    override public ICredentials Credentials
    {
        set {  }
    }

    public XHTMLResolver()
    {

    }

    public override Uri ResolveUri(Uri baseUri, String relativeUri)
    {
        if (String.Compare(relativeUri, "-//W3C//DTD XHTML 1.0 
            Transitional//EN", true) == 0)
        {
            return new Uri("http://www.w3.org/tr/xhtml1/DTD/xhtml1-
                transitional.dtd");
        }
        else if (String.Compare(relativeUri, "-//W3C//DTD XHTML 1.0 
            Transitional//EN", true) == 0)
        {
            return new Uri("http://www.w3.org/tr/xhtml1/DTD/
                xhtml1-strict.dtd");
        }
        else if (String.Compare(relativeUri, "-//W3C//DTD XHTML 1.0 
            Transitional//EN", true) == 0)
        {
            return new Uri("http://www.w3.org/tr/xhtml1/DTD/
                xhtml1-frameset.dtd");
        }
        else if (String.Compare(relativeUri, "-//W3C//DTD XHTML 
            1.1//EN", true) == 0)
        {
            return new Uri("http://www.w3.org/tr/xhtml11/DTD/xhtml11.dtd");
        }

        return base.ResolveUri(baseUri,relativeUri);
    }
    override public object GetEntity(Uri absoluteUri, string role, 
        Type ofObjectToReturn)
    {
        Object entityObj = null;
        String strURI = absoluteUri.AbsoluteUri;
        System.IO.MemoryStream msStream=null;


        switch (strURI.ToLower())
        {
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml1-transitional.dtd":
            msStream = new MemoryStream(Resource.xhtml1_transitional);
            break;
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml1.dcl":
            msStream = new MemoryStream(Resource.xhtml1);
            break;
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml-lat1.ent":
            msStream = new MemoryStream(Resource.xhtml_lat1);
            break;
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml-special.ent":
            msStream = new MemoryStream(Resource.xhtml_special);
            break;
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml-symbol.ent":
            msStream = new MemoryStream(Resource.xhtml_symbol);
            break;
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml1-strict.dtd":
            msStream = new MemoryStream(Resource.xhtml1_strict);
            break;
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml1-frameset.dtd":
            msStream = new MemoryStream(Resource.xhtml1_frameset);
            break;
        case "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd":
            msStream = new MemoryStream(Resource.xhtml11);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-inlstyle-1.mod":
            msStream = new MemoryStream(Resource.xhtml_inlstyle_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-framework-1.mod":
            msStream = new MemoryStream(Resource.xhtml_framework_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-datatypes-1.mod":
            msStream = new MemoryStream(Resource.xhtml_datatypes_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-qname-1.mod":
            msStream = new MemoryStream(Resource.xhtml_qname_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-events-1.mod":
            msStream = new MemoryStream(Resource.xhtml_events_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-attribs-1.mod":
            msStream = new MemoryStream(Resource.xhtml_attribs_1);
            break;
        case "http://www.w3.org/tr/xhtml11/dtd/
            xhtml11-model-1.mod":
            msStream = new MemoryStream(Resource.xhtml11_model_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-charent-1.mod":
            msStream = new MemoryStream(Resource.xhtml_charent_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-lat1.ent":
            msStream = new MemoryStream(Resource.xhtml_lat11);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-symbol.ent":
            msStream = new MemoryStream(Resource.xhtml_symbol11);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-special.ent":
            msStream = new MemoryStream(Resource.xhtml_special11);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-text-1.mod":
            msStream = new MemoryStream(Resource.xhtml_text_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-inlstruct-1.mod":
            msStream = new MemoryStream(Resource.xhtml_inlstruct_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-inlphras-1.mod":
            msStream = new MemoryStream(Resource.xhtml_inlphras_1);
            break;
        case "http://www.w3.org/tr/ruby/xhtml-ruby-1.mod":
            msStream = new MemoryStream(Resource.xhtml_ruby_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-blkstruct-1.mod":
            msStream = new MemoryStream(Resource.xhtml_blkstruct_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-blkphras-1.mod":
            msStream = new MemoryStream(Resource.xhtml_blkphras_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-hypertext-1.mod":
            msStream = new MemoryStream(Resource.xhtml_hypertext_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-list-1.mod":
            msStream = new MemoryStream(Resource.xhtml_list_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-edit-1.mod":
            msStream = new MemoryStream(Resource.xhtml_edit_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-bdo-1.mod":
            msStream = new MemoryStream(Resource.xhtml_bdo_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-pres-1.mod":
            msStream = new MemoryStream(Resource.xhtml_pres_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-inlpres-1.mod":
            msStream = new MemoryStream(Resource.xhtml_inlpres_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-blkpres-1.mod":
            msStream = new MemoryStream(Resource.xhtml_blkpres_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-link-1.mod":
            msStream = new MemoryStream(Resource.xhtml_link_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-meta-1.mod":
            msStream = new MemoryStream(Resource.xhtml_meta_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-base-1.mod":
            msStream = new MemoryStream(Resource.xhtml_base_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-script-1.mod":
            msStream = new MemoryStream(Resource.xhtml_script_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-style-1.mod":
            msStream = new MemoryStream(Resource.xhtml_style_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-image-1.mod":
            msStream = new MemoryStream(Resource.xhtml_image_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-csismap-1.mod":
            msStream = new MemoryStream(Resource.xhtml_csismap_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-ssismap-1.mod":
            msStream = new MemoryStream(Resource.xhtml_ssismap_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-param-1.mod":
            msStream = new MemoryStream(Resource.xhtml_param_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-object-1.mod":
            msStream = new MemoryStream(Resource.xhtml_object_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-table-1.mod":
            msStream = new MemoryStream(Resource.xhtml_table_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-form-1.mod":
            msStream = new MemoryStream(Resource.xhtml_form_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-struct-1.mod":
            msStream = new MemoryStream(Resource.xhtml_struct_1);
            break;
        }

        if (msStream != null)
        {
            entityObj = msStream;
        }
        else
        {
            XmlUrlResolver xur = new XmlUrlResolver();
            entityObj = xur.GetEntity(absoluteUri, role, ofObjectToReturn);
        }
        return entityObj;
    }

Using the Code

通过使用 HTML2XHTML 方法,您可以将 HTML 文件转换为 XHTML 文件。

System.Net.WebClient webClient = new System.Net.WebClient();
String strHTMLContent = webClient.DownloadString("https://codeproject.org.cn");
String strXHTMLContent = ChinaCars.Util.XMLUtil.HTML2XHTML(strHTMLContent);

通过使用 XHTMLResolver,您可以非常快速地将 XHTML 文件解析为 XML。

System.Xml.XmlDocument xmlDoc=new System.Xml.XmlDocument();
xmlDoc.XmlResolver =new ChinaCars.Util.XHTMLResolver();
xmlDoc.LoadXml(xmlContent);

历史

  • 2007 年 3 月 12 日:发布第一个版本

许可证

本文没有附加明确的许可协议,但可能在文章文本或下载文件中包含使用条款。 如果有疑问,请通过下面的讨论区联系作者。 可以在 这里 找到作者可能使用的许可证列表。

© . All rights reserved.