数据 URI 图片提取器

Chad Z. Hower aka Kudzu

4.74/5 (5投票s)

2011年7月5日

BSD

2分钟阅读

38002

624

本文将展示一种简单的方法来提取 HTML 数据 URI 图像，并将 HTML 转换为使用外部图像。

下载源代码 - 3.51 KB

引言

在 HTML 中，实际上可以将原始图像数据直接嵌入到 HTML 中，从而无需单独的图像文件。在许多情况下，这可以加快 HTTP 传输，但存在兼容性问题，尤其是在 Internet Explorer 8 及更早版本中。本文将展示一种简单的方法来提取这些图像并将 HTML 转换为使用外部图像。

数据 URI

通常，图像使用以下语法包含

<img src="Image1.png">

然而，数据 URI 语法允许直接嵌入图像，减少 HTTP 请求的数量，并允许将其保存为单个文件。虽然本文仅处理 img 标签，但此方法也可以应用于其他标签。以下是数据 URI 用法的示例

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==">

有关数据 URI 的更多信息，请参见此处

SeaMonkey 2.1

大多数编辑器不使用数据 URI 语法。但是，从 SeaMonkey 2.1 Composer（Mozilla HTML 编辑器）开始，拖放的图像将使用此语法导入。我认为这是一个非常糟糕的变化，特别是由于它不明显并且与 2.0 相比行为发生了变化。在我的例子中，我在发现它没有链接它们，而是嵌入它们之前，创建了一个包含超过 50 个图像的大型 HTML 文件。

实用程序

令人惊讶的是，有很多在线实用程序可以将图像转换为数据 URI 格式，但没有我找到的可以执行反向操作的实用程序。因为我不想手动编辑我的文档，所以我编写了一个快速实用程序来将图像提取到磁盘并将 HTML 更改为使用外部图像。这允许该文档被任何标准浏览器加载，包括 Internet Explorer 8。

关于源代码

源代码非常针对我的特定需求。它有很多限制。但是，我发布了它，以便您可以根据自己的需要进行扩展。

解析非常基本，但可以很好地与 SeaMonkey 输出一起使用。
它当前仅支持 PNG 格式。
没有异常处理。
代码没有任何优化。

用法

ImageExtract 是一个控制台应用程序，并接受一个参数。该参数是输入 HTML 文件。图像将在同一目录中输出，并且新的 HTML 文件将带有 -new 后缀。因此，如果输入是 index.html，则输出 HTML 将是 index-new.html。

源代码

我已经提供了项目下载，但它非常简单。它是一个 C# .NET 控制台应用程序。为了方便查看，这是该类

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace ImageExtract {
  class Program {
    // NOTE - This program is rough and dirty - I designed
    // it to accomplish and urgent task. I have not built in 
    // normal error handling etc.
    //
    // It also has not been optimized at all
    // and certainly is not very efficient.
    //
    // It also assumes all images are png files.
    static void Main(string[] aArgs) {
      string xSrcPathname = aArgs[0];
      string xPath = Path.GetDirectoryName(xSrcPathname);
      string xDestPathname = Path.Combine(xPath, 
             Path.GetFileNameWithoutExtension(xSrcPathname) + "-New.html");
      int xImgIdx = 0;

      Console.WriteLine("Processing " + Path.GetFileName(xSrcPathname));
      string xSrc = File.ReadAllText(xSrcPathname);
      var xDest = new StringBuilder();

      string xStart = @"data:image/png;base64,";
      string xB64;
      int x = 0;
      int y = 0;
      int z = 0;
      do {
        x = xSrc.IndexOf(xStart, z);
        if (x == -1) {
          break;
        }
        // Write out preceding HTML
        xDest.Append(xSrc.Substring(z, x - z));

        // Get the Base64 string
        y = xSrc.IndexOf('"', x + 1);
        xB64 = xSrc.Substring(x + xStart.Length, y - x - xStart.Length);
        // Convert the Base64 string to binary data
        byte[] xImgData = System.Convert.FromBase64String(xB64);

        string xImgName;
        // Get Image name and replace it in the HTML
        // We don't want to overwrite images that might already exist on disk,
        // so cycle till we find a non used name
        do {
          xImgIdx++;
          xImgName = "Image" + xImgIdx.ToString("0000") + ".png";
        } while (File.Exists(Path.Combine(xPath, xImgName)));

        Console.WriteLine("Extracting " + xImgName);

        // Write image name into HTML
        xDest.Append(xImgName);
        // Write the binary data to disk
        File.WriteAllBytes(Path.Combine(xPath, xImgName), xImgData);

        z = y;
      } while (true);
      // Write out remaining HTML
      xDest.Append(xSrc.Substring(z));

      // Write out result
      File.WriteAllText(xDestPathname, xDest.ToString());
      Console.WriteLine("Output to " + Path.GetFileName(xDestPathname));
    }
  }
}

历史

2011 年 7 月 5 日：初始版本