65.9K
CodeProject 正在变化。 阅读更多。
Home

在 WebMatrix 站点中将 PDF 文件转换为纯文本

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.10/5 (3投票s)

2013年3月15日

CPOL

2分钟阅读

viewsIcon

23813

downloadIcon

296

如何在 ASP.NET Web Pages 项目中使用 PDFBox Java 库

引言

如果你想在你的网站上添加通过内容搜索存储文档的功能,你必须完成的首要任务是将格式化的文档转换为纯文本。

如果你的文档是 PDF 文件,并且你的网站使用 ASP.NET 框架,那么你可以选择以下方案:在 C# 中将 PDF 转换为文本 这篇文章列出了一些你可以尝试的解决方案。

根据这篇文章(我从中获得了一些有用的信息),我选择了 PDFBox 库

即使 PDFBox 是一个 Java 库,你也可以借助 IKVM.NET 在 .NET 框架中使用它,IKVM.NET 是 Mono 和 Microsoft .NET Framework 的 Java 实现。

通过 IKVM,可以构建 PDFBox 的 .NET 版本:我使用了基于官方 PDFBox 1.7.0 库的非官方 .NET 版本,该版本托管在 http://pdfbox.lehmi.de/

由于我想实现的网站是用 Web Pages 开发的,我创建了一个 WebMatrix 中的示例网站来试验这个过程。

代码描述

此网站基于 PdfFile 类,该类负责从 PDF 文件元数据中获取所有可用属性,并使用 PDFBox 库提取 PDF 文件文本:

using System;
using System.Collections.Generic;
using System.Web;
using java.util;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;

public class PdfFile
{

    public string Author { get; set; }

    public string Content { get; set; }

    public DateTime Created { get; set; }

    public string Creator { get; set; }

    public string Keywords { get; set; }

    public DateTime Modified { get; set; }

    public string Producer { get; set; }

    public string Subject { get; set; }

    public string Title { get; set; }

    public string Trapped { get; set; }

    public static DateTime CalendarToDateTime(Calendar calendar)
    {
        if (calendar != null)
        {
            int year = calendar.get(Calendar.YEAR);
            int month = calendar.get(Calendar.MONTH) + 1;
            int day = calendar.get(Calendar.DAY_OF_MONTH);
            int hour = calendar.get(Calendar.HOUR_OF_DAY);
            int minute = calendar.get(Calendar.MINUTE);
            int second = calendar.get(Calendar.SECOND);
            int millis = calendar.get(Calendar.MILLISECOND);

            var date = new DateTime(year, month, day, hour, minute, second, millis);

            return date;
        }

        else {
            return DateTime.MinValue;
        }
    }
    
    
    public PdfFile(string FilePath)
    {
        PDDocument PdfDoc = PDDocument.load(FilePath);
        PDDocumentInformation PdfInfo = PdfDoc.getDocumentInformation();

        Title = (PdfInfo.getTitle() ?? "");
        Subject = (PdfInfo.getSubject() ?? "");
        Author = (PdfInfo.getAuthor() ?? "");
        Creator = (PdfInfo.getCreator() ?? "");
        Producer = (PdfInfo.getProducer() ?? "");
        Keywords = (PdfInfo.getKeywords() ?? "");
        Trapped = (PdfInfo.getTrapped() ?? "");
        Created = CalendarToDateTime(PdfInfo.getCreationDate());
        Modified = CalendarToDateTime(PdfInfo.getModificationDate());

        PDFTextStripper stripper = new PDFTextStripper();
        Content = stripper.getText(PdfDoc);
    }
}

简单的首页使你能够上传 .pdf 文件,将其保存到 UploadedFiles 文件夹中,将其传递给 PdfFile 类,并将其内容保存到 Temp 文件夹中作为 .txt 文件:

@using Microsoft.Web.Helpers; 

@{
    TimeSpan elapsed = TimeSpan.Zero;
    var fileName = ""; 
    var fileTitle = "";
    var fileSubject = "";
    var fileAuthor = "";
    var fileCreator = "";
    var fileProducer = "";
    var fileKeywords = "";
    DateTime fileCreation = DateTime.MinValue;
    DateTime fileModify = DateTime.MinValue;
    long fileLength = 0;


    if (IsPost){
        var start = DateTime.Now;
        var fileSavePath = ""; 
        var uploadedFile = Request.Files[0]; 
        fileName = Path.GetFileName(uploadedFile.FileName); 
        fileSavePath = Server.MapPath("~/UploadedFiles/" + fileName); 
        uploadedFile.SaveAs(fileSavePath);

        PdfFile file = new PdfFile(fileSavePath);
        fileTitle = file.Title;
        fileSubject = file.Subject;
        fileAuthor = file.Author;
        fileCreator = file.Creator;
        fileProducer = file.Producer;
        fileKeywords = file.Keywords;
        fileCreation = file.Created;
        fileModify = file.Modified;
        fileLength = file.Content.Length;
 
        var destFile = Server.MapPath("~/Temp/Content.txt");
        using (StreamWriter sw = new StreamWriter(destFile)){
            sw.WriteLine(file.Content);
        }
        elapsed = (DateTime.Now - start);
    }   
}

<!DOCTYPE html>

<html lang="en">
    <head>
        <meta charset="utf-8" />
        <title>From PDF to Text</title>
        <link href="~/favicon.ico" rel="shortcut icon" type="image/x-icon" />
        <link href="~/Content/Style.css" rel="stylesheet" type="text/css" />
        <script type="text/javascript">
            function myFunction()
            {
                alert("Hello World!");
            }
        </script>
    </head>
    <body>
        <h2>From PDF to Text</h2>
        <div>
            <form enctype="multipart/form-data" method="post">
                <p><label for="fileUpload">PDF file</label></p>
                @FileUpload.GetHtml( 
                    initialNumberOfFiles:1, 
                    allowMoreFilesToBeAdded:false, 
                    includeFormTag:false, 
                    uploadText:"")
                <div>
                    <input type="submit" name="action" value="Upload" />
                </div>
            </form>
        </div>
        <hr>
        @if(IsPost){
            <div>
                <h3>Uploaded file: @fileName</h3>
                <p>Title: @fileTitle</p>
                <p>Subject: @fileSubject</p>
                <p>Author: @fileAuthor</p>
                <p>Creator: @fileCreator</p>
                <p>Producer: @fileProducer</p>
                <p>Keywords: @fileKeywords</p>
                <p>Created: @fileCreation</p>
                <p>Modified: @fileModify</p>
            </div>
            <hr>
            <div>
                <h3>@fileLength characters extracted in @elapsed</h3>
                @if (fileLength > 0) {
                    var fname = "Content.txt";
                    <input type="button" 
                        onclick="location.href('download.cshtml?filename=/Temp/@fname');" value="Open">
                }
            </div>
        }
    </body>
</html>

在流程结束时,用户可以通过请求处理程序页面 (download.cshtml) 来下载文本文件

@{ 
    if(!Request["filename"].IsEmpty()){ 
        var filename = Request["filename"]; 
        Functions.DownloadFile(Server.MapPath(filename));
    } 
} 

另一个关注点 

另一个小小的关注点是处理程序页面用于下载文件的函数:我从 DotNetSlackers 博客 获得了它,这是一个用于下载任何类型文件的有用解决方案:

@functions {
    public static void DownloadFile(string filePath)
    {
        // Create new instance of FileInfo class to get the properties of the file being downloaded
        FileInfo file = new FileInfo(filePath);

        // Checking if file exists
        if (file.Exists)
        {
            // Clear the content of the response
            Response.ClearContent();

            // Add the file name and attachment, which will force the open/cancel/save dialog
            // to show, to the header
            Response.AddHeader("Content-Disposition", "attachment; filename=" + file.Name);

            // Add the file size into the response header
            Response.AddHeader("Content-Length", file.Length.ToString());

            // Set the ContentType
            Response.ContentType = ReturnExtension(file.Extension.ToLower());

            // Write the file into the response
            Response.TransmitFile(file.FullName);

            // End the response
            Response.End();

        }
    }

    private static string ReturnExtension(string fileExtension)
    {
        switch (fileExtension)
        {
            case ".htm":
            case ".html":
            case ".log":
                return "text/HTML";
            case ".txt":
                return "text/plain";
            case ".doc":
                return "application/ms-word";
            case ".tiff":
            case ".tif":
                return "image/tiff";
            case ".asf":
                return "video/x-ms-asf";
            case ".avi":
                return "video/avi";
            case ".zip":
                return "application/zip";
            case ".xls":
            case ".csv":
                return "application/vnd.ms-excel";
            case ".gif":
                return "image/gif";
            case ".jpg":
            case "jpeg":
                return "image/jpeg";
            case ".bmp":
                return "image/bmp";
            case ".wav":
                return "audio/wav";
            case ".mp3":
                return "audio/mpeg3";
            case ".mpg":
            case "mpeg":
                return "video/mpeg";
            case ".rtf":
                return "application/rtf";
            case ".asp":
                return "text/asp";
            case ".pdf":
                return "application/pdf";
            case ".fdf":
                return "application/vnd.fdf";
            case ".ppt":
                return "application/mspowerpoint";
            case ".dwg":
                return "image/vnd.dwg";
            case ".msg":
                return "application/msoutlook";
            case ".xml":
            case ".sdxl":
                return "application/xml";
            case ".xdp":
                return "application/vnd.adobe.xdp+xml";
            default:
                return "application/octet-stream";
        }
    }
}

使用示例 

使附带示例正常工作的步骤是

  •  下载并解压缩 Pdf2Text.zip 文件;
  •  启动 WebMatrix 2 并从“打开网站”菜单中选择“文件夹”作为网站;
  • 选择 Pdf2TextSite 文件夹,并从以下对话框中选择“升级到较新版本”;
  • http://pdfbox.lehmi.de/ 下载 pdfbox-1.7.0-dlls.zip,并从该文件复制到你的新网站的 bin 文件夹中 commons-logging.dllfontbox-1.7.0.dllpdfbox-1.7.0.dll
  • http://sourceforge.net/projects/ikvm/files/ 下载 ikvmbin-7.2.4630.5.zip,并将此文件的 bin 文件夹中的以下文件复制到你的网站的 bin 文件夹中:IKVM.OpenJDK.Core.dllIKVM.OpenJDK.SwingAWT.dllIKVM.OpenJDK.Text.dllIKVM.OpenJDK.Util.dllIKVM.Runtime.dll
  • 在 WebMatrix 2 中,从 NuGet 库中加载 APS.NET Web Helpers Library

© . All rights reserved.