在 WebMatrix 站点中将 PDF 文件转换为纯文本






4.10/5 (3投票s)
如何在 ASP.NET Web Pages 项目中使用 PDFBox Java 库
引言
如果你想在你的网站上添加通过内容搜索存储文档的功能,你必须完成的首要任务是将格式化的文档转换为纯文本。
如果你的文档是 PDF 文件,并且你的网站使用 ASP.NET 框架,那么你可以选择以下方案:在 C# 中将 PDF 转换为文本 这篇文章列出了一些你可以尝试的解决方案。
根据这篇文章(我从中获得了一些有用的信息),我选择了 PDFBox 库。
即使 PDFBox 是一个 Java 库,你也可以借助 IKVM.NET 在 .NET 框架中使用它,IKVM.NET 是 Mono 和 Microsoft .NET Framework 的 Java 实现。
通过 IKVM,可以构建 PDFBox 的 .NET 版本:我使用了基于官方 PDFBox 1.7.0 库的非官方 .NET 版本,该版本托管在 http://pdfbox.lehmi.de/。
由于我想实现的网站是用 Web Pages 开发的,我创建了一个 WebMatrix 中的示例网站来试验这个过程。
代码描述
此网站基于 PdfFile 类,该类负责从 PDF 文件元数据中获取所有可用属性,并使用 PDFBox 库提取 PDF 文件文本:
using System;
using System.Collections.Generic;
using System.Web;
using java.util;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
public class PdfFile
{
public string Author { get; set; }
public string Content { get; set; }
public DateTime Created { get; set; }
public string Creator { get; set; }
public string Keywords { get; set; }
public DateTime Modified { get; set; }
public string Producer { get; set; }
public string Subject { get; set; }
public string Title { get; set; }
public string Trapped { get; set; }
public static DateTime CalendarToDateTime(Calendar calendar)
{
if (calendar != null)
{
int year = calendar.get(Calendar.YEAR);
int month = calendar.get(Calendar.MONTH) + 1;
int day = calendar.get(Calendar.DAY_OF_MONTH);
int hour = calendar.get(Calendar.HOUR_OF_DAY);
int minute = calendar.get(Calendar.MINUTE);
int second = calendar.get(Calendar.SECOND);
int millis = calendar.get(Calendar.MILLISECOND);
var date = new DateTime(year, month, day, hour, minute, second, millis);
return date;
}
else {
return DateTime.MinValue;
}
}
public PdfFile(string FilePath)
{
PDDocument PdfDoc = PDDocument.load(FilePath);
PDDocumentInformation PdfInfo = PdfDoc.getDocumentInformation();
Title = (PdfInfo.getTitle() ?? "");
Subject = (PdfInfo.getSubject() ?? "");
Author = (PdfInfo.getAuthor() ?? "");
Creator = (PdfInfo.getCreator() ?? "");
Producer = (PdfInfo.getProducer() ?? "");
Keywords = (PdfInfo.getKeywords() ?? "");
Trapped = (PdfInfo.getTrapped() ?? "");
Created = CalendarToDateTime(PdfInfo.getCreationDate());
Modified = CalendarToDateTime(PdfInfo.getModificationDate());
PDFTextStripper stripper = new PDFTextStripper();
Content = stripper.getText(PdfDoc);
}
}
简单的首页使你能够上传 .pdf 文件,将其保存到 UploadedFiles 文件夹中,将其传递给 PdfFile 类,并将其内容保存到 Temp 文件夹中作为 .txt 文件:
@using Microsoft.Web.Helpers;
@{
TimeSpan elapsed = TimeSpan.Zero;
var fileName = "";
var fileTitle = "";
var fileSubject = "";
var fileAuthor = "";
var fileCreator = "";
var fileProducer = "";
var fileKeywords = "";
DateTime fileCreation = DateTime.MinValue;
DateTime fileModify = DateTime.MinValue;
long fileLength = 0;
if (IsPost){
var start = DateTime.Now;
var fileSavePath = "";
var uploadedFile = Request.Files[0];
fileName = Path.GetFileName(uploadedFile.FileName);
fileSavePath = Server.MapPath("~/UploadedFiles/" + fileName);
uploadedFile.SaveAs(fileSavePath);
PdfFile file = new PdfFile(fileSavePath);
fileTitle = file.Title;
fileSubject = file.Subject;
fileAuthor = file.Author;
fileCreator = file.Creator;
fileProducer = file.Producer;
fileKeywords = file.Keywords;
fileCreation = file.Created;
fileModify = file.Modified;
fileLength = file.Content.Length;
var destFile = Server.MapPath("~/Temp/Content.txt");
using (StreamWriter sw = new StreamWriter(destFile)){
sw.WriteLine(file.Content);
}
elapsed = (DateTime.Now - start);
}
}
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>From PDF to Text</title>
<link href="~/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<link href="~/Content/Style.css" rel="stylesheet" type="text/css" />
<script type="text/javascript">
function myFunction()
{
alert("Hello World!");
}
</script>
</head>
<body>
<h2>From PDF to Text</h2>
<div>
<form enctype="multipart/form-data" method="post">
<p><label for="fileUpload">PDF file</label></p>
@FileUpload.GetHtml(
initialNumberOfFiles:1,
allowMoreFilesToBeAdded:false,
includeFormTag:false,
uploadText:"")
<div>
<input type="submit" name="action" value="Upload" />
</div>
</form>
</div>
<hr>
@if(IsPost){
<div>
<h3>Uploaded file: @fileName</h3>
<p>Title: @fileTitle</p>
<p>Subject: @fileSubject</p>
<p>Author: @fileAuthor</p>
<p>Creator: @fileCreator</p>
<p>Producer: @fileProducer</p>
<p>Keywords: @fileKeywords</p>
<p>Created: @fileCreation</p>
<p>Modified: @fileModify</p>
</div>
<hr>
<div>
<h3>@fileLength characters extracted in @elapsed</h3>
@if (fileLength > 0) {
var fname = "Content.txt";
<input type="button"
onclick="location.href('download.cshtml?filename=/Temp/@fname');" value="Open">
}
</div>
}
</body>
</html>
在流程结束时,用户可以通过请求处理程序页面 (download.cshtml) 来下载文本文件
@{
if(!Request["filename"].IsEmpty()){
var filename = Request["filename"];
Functions.DownloadFile(Server.MapPath(filename));
}
}
另一个关注点
另一个小小的关注点是处理程序页面用于下载文件的函数:我从 DotNetSlackers 博客 获得了它,这是一个用于下载任何类型文件的有用解决方案:
@functions {
public static void DownloadFile(string filePath)
{
// Create new instance of FileInfo class to get the properties of the file being downloaded
FileInfo file = new FileInfo(filePath);
// Checking if file exists
if (file.Exists)
{
// Clear the content of the response
Response.ClearContent();
// Add the file name and attachment, which will force the open/cancel/save dialog
// to show, to the header
Response.AddHeader("Content-Disposition", "attachment; filename=" + file.Name);
// Add the file size into the response header
Response.AddHeader("Content-Length", file.Length.ToString());
// Set the ContentType
Response.ContentType = ReturnExtension(file.Extension.ToLower());
// Write the file into the response
Response.TransmitFile(file.FullName);
// End the response
Response.End();
}
}
private static string ReturnExtension(string fileExtension)
{
switch (fileExtension)
{
case ".htm":
case ".html":
case ".log":
return "text/HTML";
case ".txt":
return "text/plain";
case ".doc":
return "application/ms-word";
case ".tiff":
case ".tif":
return "image/tiff";
case ".asf":
return "video/x-ms-asf";
case ".avi":
return "video/avi";
case ".zip":
return "application/zip";
case ".xls":
case ".csv":
return "application/vnd.ms-excel";
case ".gif":
return "image/gif";
case ".jpg":
case "jpeg":
return "image/jpeg";
case ".bmp":
return "image/bmp";
case ".wav":
return "audio/wav";
case ".mp3":
return "audio/mpeg3";
case ".mpg":
case "mpeg":
return "video/mpeg";
case ".rtf":
return "application/rtf";
case ".asp":
return "text/asp";
case ".pdf":
return "application/pdf";
case ".fdf":
return "application/vnd.fdf";
case ".ppt":
return "application/mspowerpoint";
case ".dwg":
return "image/vnd.dwg";
case ".msg":
return "application/msoutlook";
case ".xml":
case ".sdxl":
return "application/xml";
case ".xdp":
return "application/vnd.adobe.xdp+xml";
default:
return "application/octet-stream";
}
}
}
使用示例
使附带示例正常工作的步骤是
- 下载并解压缩
Pdf2Text.zip
文件; - 启动 WebMatrix 2 并从“打开网站”菜单中选择“文件夹”作为网站;
- 选择
Pdf2TextSite
文件夹,并从以下对话框中选择“升级到较新版本”; - 从 http://pdfbox.lehmi.de/ 下载
pdfbox-1.7.0-dlls.zip
,并从该文件复制到你的新网站的bin
文件夹中commons-logging.dll
、fontbox-1.7.0.dll
和pdfbox-1.7.0.dll
; - 从 http://sourceforge.net/projects/ikvm/files/ 下载
ikvmbin-7.2.4630.5.zip
,并将此文件的bin
文件夹中的以下文件复制到你的网站的bin
文件夹中:IKVM.OpenJDK.Core.dll
、IKVM.OpenJDK.SwingAWT.dll
、IKVM.OpenJDK.Text.dll
、IKVM.OpenJDK.Util.dll
和IKVM.Runtime.dll
; - 在 WebMatrix 2 中,从 NuGet 库中加载
APS.NET Web Helpers Library
。