65.9K
CodeProject 正在变化。 阅读更多。
Home

将一系列 PDF 中的表格转换为 XML 数据库的示例

starIconstarIconstarIconstarIconstarIcon

5.00/5 (4投票s)

2012年12月14日

CPOL

4分钟阅读

viewsIcon

20770

关于如何将 Microsoft Access 输出的包含表格的一系列 PDF 转换为 XML 数据库的演练。

引言

一个示例应用程序,展示了如何将 Microsoft Access 2007 输出的包含表格的一系列 PDF 转换为 XML 数据库。

背景

热拉尔·R·福特总统图书馆和博物馆 (Gerald R. Ford Presidential Library and Museum) 正在与维基媒体共享资源 (Wikimedia Commons)(维基媒体项目(包括维基百科)的媒体文件存储库)合作,通过 GLAM (GLAM ) 项目捐赠与杰拉尔德·R·福特 (Gerald R. Ford) 的生平和总统任期相关的资料。这些媒体捐赠中的一部分是 200dpi 的一系列联系单 (contact sheets),记录了福特的遗产。

这些联系单可在热拉尔·R·福特总统数字图书馆 (在线获取),包含约 290,000 张由白宫摄影办公室于 1974-77 年拍摄的照片。约 10,000 张联系单也可在线获取,并按时间顺序排列,每个胶卷 (photographic film) 都有一个标识符。联系单页面未提供胶卷的描述。相反,描述单独提供,形式是一系列按时间顺序排列的 PDF 文件,其中包含从 Microsoft Access 2007 输出的表格,这些表格包含了联系单的元数据。原始数据库(表格从中输出)不可用。 

为了让上传到维基媒体共享资源的文件的可被找到和使用,它们应包含所有相关的元数据,如描述、作者、日期等。这些 PDF 文件包含持有这些数据的表格,应与相关的联系单相关联。解决方案是下载所有 PDF 文件以创建一个数据库,这样相关的元数据就可以被机器读取,并在上传联系单时使用。

PDF 表格

对于这个示例文件,PDF 表格的显示方式如下:

第 1 页

第 2 页

在 Adobe Acrobat 9 中,表格可以导出为 XML,在 XML 中进行解析和拼接。

初始关联

重建关系数据库最简单的方法是找到唯一键。每个联系单都有一个唯一的标识符。例如,一个联系单的 URL 是: 

http://www.fordlibrarymuseum.gov/library/whphotos/A0032_NLGRF.jpg 

联系单 URL 的格式是:  

http://www.fordlibrarymuseum.gov/library/whphotos/<identifier>_NLGRF.jpg  

此标识符存在于表格中,可用于获取联系单。

表格转换

为了使用 PDF 中的表格,需要读取它们。读取 PDF 格式是一个挑战。为了使数据更容易访问,可以将 PDF 导出为不同的格式。在本例中,PDF 将导出为 HTML 3.2。HTML 4 w/ CSS 引入了颜色标记,会不必要地增加文件大小。HTML3.2 支持表格,这正是我们需要的。XML 1.0 导出会有数据丢失,许多列会为空。

相反,PDF 将被批量导出为 HTML3.2(遵循此处的方法),然后读取 HTML 输出。幸运的是,HTML 导出保留了表格结构,便于解析。

要批量将 PDF 导出为 HTML 3.2,您至少需要 Adobe Acrobat Pro 9。这里我们将使用 Acrobat Pro 9.1  

  1. 选择“文件”->“导出”->“导出多个文件”
  2. 点击“添加文件”->“添加文件”,然后添加要转换的文件。您也可以通过拖放文件的方式添加。按“确定”。
  3. 指定输出文件夹,并将“导出为”更改为“XML 1.0”。然后按“确定”。文件现在将被转换。

由于 PDF 是由 Microsoft Access 输出的,它们是带标记的 PDF,表格被明确标记。在 HTML 输出中,这一点通过结构可以明显看出。 

<TABLE border=0 cellSpacing=0 cellPadding=2 align=center><TBODY>
<TR>
<TH height=18 vAlign=center width=40 align=left>
   <FONT color=#00214d size=+1>Roll # </FONT></TH>
<TH height=18 vAlign=center width=48 align=left>Frames </TH>
<TH height=18 vAlign=center width=38 align=left>Tone </TH>
<TH height=18 vAlign=center width=235 align=left>Subject -Proper </TH>
<TH height=18 vAlign=center width=144 align=left>Subject -Generic </TH>
<TH height=18 vAlign=center width=161 align=left>Names </TH>
<TH height=18 vAlign=center width=100 align=left>Geographic </TH>
<TH height=18 vAlign=center width=82 align=left>Location </TH>
<TH height=18 vAlign=center width=80 align=left>Photographer </TH></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff></B>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>2 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>greeting each other </TD>
<TD height=35 vAlign=top width=161 align=left>Kissinger, Carl Albert </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>3-5 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>talking </TD>
<TD height=35 vAlign=center width=161 align=left>GRF, Chief Justice Warren Burger </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>6-7 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>arm raised </TD>
<TD height=35 vAlign=center width=161 align=left>GRF, Chief Justice Warren Burger </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>8 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>arm raised </TD>
<TD height=35 vAlign=center width=161 align=left>GRF, Chief Justice Warren Burger </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>9-13 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>acknowledging applause </TD>
<TD height=35 vAlign=top width=161 align=left>GRF, Betty Ford </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>14-18 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=top width=144 align=left>Acceptance Speech </TD>
<TD height=35 vAlign=top width=161 align=left>GRF </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0001_NLGRF.jpg">
  <FONT color=#0000ff>A0001 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>19-21 </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=center width=144 align=left>acknowledging applause; walking off stage </TD>
<TD height=35 vAlign=top width=161 align=left>GRF, Betty Ford </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=34 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0002_NLGRF.jpg">
  <FONT color=#0000ff>A0002 </A></FONT></TD>
<TD height=34 vAlign=top width=48 align=left><FONT color=#000000>1A-4A </FONT></TD>
<TD height=34 vAlign=top width=38 align=left>Color </TD>
<TD height=34 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=34 vAlign=top width=144 align=left>Swearing In </TD>
<TD height=34 vAlign=center width=161 align=left>GRF, Chief Justice Warren Burger, Audience </TD>
<TD height=34 vAlign=top width=100 align=left></TD>
<TD height=34 vAlign=top width=82 align=left>East Room </TD>
<TD height=34 vAlign=top width=80 align=left>Moore </TD></TR>
<TR>
<TD height=35 vAlign=top width=40 align=left>
  <A href="http://www.fordlibrarymuseum.gov/library/whphotos/A0002_NLGRF.jpg">
  <FONT color=#0000ff>A0002 </A></FONT></TD>
<TD height=35 vAlign=top width=48 align=left><FONT color=#000000>5A-13A </FONT></TD>
<TD height=35 vAlign=top width=38 align=left>Color </TD>
<TD height=35 vAlign=center width=235 align=left>
  Swearing in of Gerald R. Ford as the 38th President of the United States </TD>
<TD height=35 vAlign=center width=144 align=left>Acceptance Speech, long shots </TD>
<TD height=35 vAlign=center width=161 align=left>
  GRF, Betty Ford, Chief Justice Warren Burger, Audience </TD>
<TD height=35 vAlign=top width=100 align=left></TD>
<TD height=35 vAlign=top width=82 align=left>East Room </TD>
<TD height=35 vAlign=top width=80 align=left>Moore</TD></TR></TBODY></TABLE>

需要将这些表格拼接起来以重建数据库。 

获取 PDF  

PDF 链接的按时间顺序列表可在线上获得。

一个示例链接是

http://www.fordlibrarymuseum.gov/library/whphotos/19740809whpo.pdf

链接的格式是

http://www.fordlibrarymuseum.gov/library/whphotos/<year><month><day>whpo.pdf

为了下载页面上的所有 PDF,我们将使用免费的 Firefox 扩展程序DownloadThemAll!

  1. 访问网页后,右键单击网页,然后从上下文菜单中选择“DownloadThemAll”。
  2. 输入要保存文件的位置。在“过滤器”下,勾选“文档”,取消勾选所有其他选项。如果您向下滚动,您会看到要下载的文件(以绿色显示)。点击“开始!”开始下载。
  3. 等待所有文件下载完成。 

整合所有内容

所有 PDF 下载完成后,就可以开始整合了。PDF 将被批量转换为 XML。然后,一个程序将遍历每个 XML 文件,提取表格,并在另一个 XML 文件中构建表格。

创建的 XML 数据库将具有以下大纲:

<root>
  <roll id="x">
    <frames name="" tone="" subjectproper="" 
      subjectgeneric="" names="" geographic="" 
      location="" photographer="" date=""/>
    <frames name="" tone="" subjectproper="" 
      subjectgeneric="" names="" geographic="" 
      location="" photographer="" date=""/>
  </roll>
</root>

DTD 将是:

<!DOCTYPE root [
<!ELEMENT root  ( roll+ )>
 
<!ELEMENT roll  ( frames+ )>
<!ATTLIST roll
id ID #REQUIRED
>
 
<!ELEMENT frames  EMPTY>
<!ATTLIST frames
name CDATA #REQUIRED
tone CDATA #REQUIRED
subjectproper CDATA #REQUIRED
subjectgeneric CDATA #REQUIRED
names CDATA #REQUIRED
geographic CDATA #REQUIRED
location CDATA #REQUIRED
photographer CDATA #REQUIRED
date CDATA #REQUIRED
>
]>

数据将从 PDF 中的表格以及提供日期的 PDF 名称中提取。PDF 的日期格式是:

<year><month><day>whpo.pdf

对于程序,我们将使用 VBScript。您将使用 MSHTML 读取 HTML 文件,使用 *Msxml2.DOMDocument.6.0* 创建 XML 文档。*Msxml2.SAXXMLReader.6.0* 和 *Msxml2.SAXXMLWriter.6.0* 用于美化 XML 文档。

Option Explicit
 
Dim htmlin: Set htmlin = CreateObject("htmlfile")'mshtml
Dim fs: Set fs = CreateObject("Scripting.FileSystemObject")
 
Dim XMLout: Set XMLout = CreateObject("Msxml2.DOMDocument.6.0")
XMLout.async = False
 
Dim root: Set root = XMLout.createElement("root")
XMLout.appendChild(root)
 
Dim FSO: Set FSO = CreateObject("Scripting.FileSystemObject")
Dim folder: Set folder = FSO.GetFolder("C:\Ford\HTML\")
Dim file
For Each file In folder.Files
   'If XMLin.load(file.Path) Then
    Dim oFile: Set oFile = FSO.OpenTextFile(file)
    htmlin.open("about:blank")
    htmlin.write(oFile.ReadAll())
    oFile.Close
        
    'date format is year-month-day
    Dim filedate: filedate = FSO.GetBaseName(file)
    WScript.Echo filedate
    filedate = Mid(filedate,1,4) & "-" & _
                Mid(filedate,5,2) & "-" & Mid(filedate,7,2) 
    
    Dim lastgoodnode

    Dim tables: Set tables = htmlin.getElementsByTagName("Table")
    Dim table
    For Each table In tables
        
        Dim rows: Set rows = table.getElementsByTagName("TR")
        Dim row
        For Each row In rows
            Dim columns: Set columns = row.getElementsByTagName("TD")'Header is TH
            If columns.length = 0 Then 'Header
                'ignore
            ElseIf columns.length <> 9 Then
                WScript.StdOut.WriteLine "Column Error: " & _
                        file.Path & "|" & row.xml
            Else 
                'columns.item(0).text
                '1
                Dim id: id = Trim(columns.item(0).innerText)                    
                Dim rollid: Set rollid= Nothing
                If(id = "") Then 'continue from last node
                
                    Dim subjectproper: subjectproper = LCase(columns.item(3).innerText)
                    If Not(InStr(subjectproper,"roll") Or _
                        InStr(subjectproper,"taken on") Or _
                        InStr(subjectproper,"empty folder")) Then
                        
                    Set rollid = lastgoodnode
                    'WScript.Echo "empty node"
                    lastgoodnode.setAttribute "name", _
                      lastgoodnode.getAttribute("name") & columns.item(1).innerText
                    lastgoodnode.setAttribute "tone", _
                      lastgoodnode.getAttribute("tone") & columns.item(2).innerText
                    lastgoodnode.setAttribute "subjectproper", _
                      lastgoodnode.getAttribute("subjectproper") & columns.item(3).innerText
                    lastgoodnode.setAttribute "subjectgeneric", _
                      lastgoodnode.getAttribute("subjectgeneric") & columns.item(4).innerText
                    lastgoodnode.setAttribute "names", _
                      lastgoodnode.getAttribute("names") & columns.item(5).innerText
                    lastgoodnode.setAttribute "geographic", _
                      lastgoodnode.getAttribute("geographic") & columns.item(6).innerText
                    lastgoodnode.setAttribute "location", _
                      lastgoodnode.getAttribute("location") & columns.item(7).innerText
                    lastgoodnode.setAttribute "photographer", _
                      lastgoodnode.getAttribute("photographer") & columns.item(8).innerText
                    End If
                        
                Else
                    Set rollid = XMLout.SelectSingleNode("/root/roll[@id='" & _
                        id & "']")'nodeFromID(id)'SelectSingleNode("//roll[@id='" & _
                        id & "']")'nodeFromID(id)
                    If rollid Is Nothing Then 'Create new roll if not there
                        'WScript.Echo "id: " & id
                        Set rollid = XMLout.createElement("roll")
                        rollid.setAttribute "id", id
                        root.appendChild(rollid)
                    End If
                        
                    'WScript.Echo "f: " & columns.item(1).innerText
                    Dim frames: Set frames = XMLout.createElement("frames")
                    
                    frames.setAttribute "name", columns.item(1).innerText
                    frames.setAttribute "tone", columns.item(2).innerText
                    frames.setAttribute "subjectproper", columns.item(3).innerText
                    frames.setAttribute "subjectgeneric", columns.item(4).innerText
                    frames.setAttribute "names", columns.item(5).innerText
                    frames.setAttribute "geographic", columns.item(6).innerText
                    frames.setAttribute "location", columns.item(7).innerText
                    frames.setAttribute "photographer", columns.item(8).innerText
                    frames.setAttribute "date", filedate
                    
                    Set lastgoodnode = frames
                    rollid.appendChild(frames)
                End If
            End If
        Next
    Next
Next
 
WScript.Echo "Saving xml file"
 
Dim writer: Set writer = CreateObject("Msxml2.MXXMLWriter.6.0")
With writer
    .indent = True
    .omitXMLDeclaration = False
    .standalone = True
    '.encoding = "utf-8"
End With
Dim reader: set reader = CreateObject("Msxml2.SAXXMLReader.6.0")
With reader
    .contentHandler = writer
    .putProperty "http://xml.org/sax/properties/lexical-handler", writer
    .parse(XMLout)
End With
 
Dim Stream : Set Stream = CreateObject("ADODB.Stream")
With Stream
    .Open
    .WriteText writer.output
    .SaveToFile ("C:\Ford\data1.xml")
    .Close
End With

生成的数据库(约 36 MB)可在以下位置找到:dropbox

待办事项

进一步解释 VBScript 代码。

历史

  • 2012 年 12 月 14 日:首次发布。
© . All rights reserved.