Web 安全 Mono 网络 .NET4.5 .NET4 .NET 3.5 初学者 C# 3.0 C# 4.0 开发 .NET C#

一个简单而强大的库，用于处理网络机器人控制策略

bluecurve01

4.29/5 (3投票s)

2014年3月6日

MIT

10662

如何解析 robots.txt 和 robots meta 标签

引言

在本技巧中，我将介绍我的库 WWW RobotRules (https://robotrules.codeplex.com/)。这是一个用于解析 robots.txt 和 robots meta 标签的简单库。该库完全符合 RFC 1808 和 RFC 1945。

Using the Code

配置

RobotRulesUseCache：布尔值，用于激活或停用缓存支持
RobotRulesCacheLibrary：类型定义 string，如果 RobotRulesUseCache 为 False ，则为可选

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <appSettings>
    <add key="RobotRulesUseCache" value="False"/>
    <add key="RobotRulesCacheLibrary" 
    value="RobotRules.Cache.MemoryCache, RobotRules"/>
    <add key="RobotRulesCacheTimeout" value="00:01:00" />
  </appSettings>
</configuration>

使用该库

首先，使用您的机器人用户代理定义一个新的解析器

using RobotRules; 
 
private RobotsFileParser RobotRules = new RobotsFileParser() 
{
 LocalUserAgent = @"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
};

然后，像这样使用它

RobotRules.Parse(new Uri("http://blablabla.com"));
if (RobotRules.IsAllowed("GoogleBot", new Uri ("http://blablabla.com"))) {
   // your code ...
}

这段代码很棒，但是如果机器人控制规则嵌入到 HTML 代码中呢？

示例

<!DOCTYPE html>
 
<html lang="en" 
xmlns="<a href="http://www.w3.org/1999/xhtml">http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" />
    <title>Test</title>
    <meta name="robots" content="nofollow"/>
</head>
<body>
 
</body>
</html>

不用担心，只需像这样使用该库

RobotsFileParser RobotRules = new RobotsFileParser()
{
    LocalUserAgent =  @"Mozilla/5.0 
    (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
};

RobotControlStrategy strategy = RobotRules.CheckRobotControlStrategy
("Googlebot", "HTML CONTENT");

if (strategy.CanFollow)
{
    // your code
}
if (strategy.CanIndex)
{
    // your code
}

关注点

使用 MEF 加载缓存插件，而不是反射

历史

V1：2014年3月6日
V1.5.2.4
- ICache 现在继承自 IDisposable
- 修复缓存初始化
- RobotsFileParser 是可释放的
- RobotsFileParser 公开了方法 ClearCache()
- 添加新的配置键 RobotRulesCacheTimeout 以指定缓存超时时间