实现一个永不耗尽内存的 std::map 替换方案以及生成 ARPA 兼容语言模型来测试实现的说明





5.00/5 (10投票s)
一篇关于改进 STL 容器以支持磁盘缓存,从而消除内存限制问题的文章。

引言
尽管 STL 容器的实用性毋庸置疑,但它们也存在固有的局限性。我本人在语音识别行业摸爬滚打,经常需要处理海量数据,STL 容器最明显的局限性在于所有数据都需要驻留在内存中。
在听写语音识别软件的上下文中,大多数语音识别引擎都需要 ARPA 兼容的语言模型来完成其工作。试想一下,从数亿兆字节的语料库数据构建语言模型,并在过程中确定不仅仅是单词,甚至是单词序列的概率。仅使用非托管的 STL 容器,这样的任务几乎是不可能完成的。
在本文中,我将揭示解决该问题的方法,并借此机会向您介绍语音识别(或至少其中的一部分)的奇妙世界,通过代码讲解如何从数亿兆字节的语料库数据生成 ARPA 兼容的语言模型,而且永不耗尽内存。
背景
ARPA 兼容的语言模型已经存在了相当长一段时间,并且在可预见的未来不会消失。它们是语音识别引擎如何偏置其语音输入并从而识别文本数据的首要手段。虽然通用的英语语言模型显然有其用途,但近年来,趋势已转向生成更专业的语言模型。例如,放射科语言模型可以在医疗领域使用,其中医生们使用的历史语料库数据将被用来生成高度专业的模型,从而最大限度地减少误识别,因为词汇将受到更多限制。
这里有一个实际的语言模型供我们参考。它来自以下文本语料库:
This is a test
This is a second test
在固定折扣质量为 0.4(稍后会详细介绍)的情况下,从该有限文本语料库生成的 trigram(限制为 3 个单词的序列)语言模型如下所示:
\data\
ngram 1=5
ngram 2=5
ngram 3=4
\1-grams:
-0.8751 This -0.3358
-0.8751 a -0.3010
-0.8751 is -0.3358
-1.1761 second -0.3358
-0.8751 test -0.3979
\2-grams:
-0.2218 This is 0.0000
-0.5229 a second 0.0000
-0.5229 a test. -0.3979
-0.2218 is a 0.0000
-0.2218 second test -0.3979
\3-grams:
-0.2218 This is a
-0.2218 a second test
-0.5229 is a second
-0.5229 is a test
\end\
每个单词序列的左侧是概率,右侧是回退概率(3-grams 除外)。工作原理如下。例如,我们来看这一行 `"-0.8751 This -0.3358"`。
让我们从简单的开始。语料库中单词“this”的概率是 2/9(语料库共有 9 个单词,其中“this”出现 2 次),乘以 0.6(1 - 0.4,未折扣质量)。ARPA 兼容语言模型所需的数字实际上是该计算的对数。因此,log(2/9 * (1 - 0.4)) = -0.8751。
关于折扣和回退概率需要稍作说明。回退概率与未终止的语音有关。例如,如果我说话,可能会说“this is a”,然后继续说“test”。没有回退概率,语音识别引擎将无法识别未终止的语音,因为它没有与之关联的概率。出于这个简单原因,并且由于概率总和必须为一,我们取一部分概率质量(在本例中为 0.4 的折扣),并将其保留给回退概率。由于模型是 trigram,您会理解 3-grams 没有与之关联的回退概率。
因此,this 的回退概率计算如下。首先计算以 N-Gram 单词开头的 (N+1)-gram 单词序列的 N-gram 概率总和,然后除以折扣质量并取对数。对于“this”的情况,这意味着 log(0.4/(1-0.13332)),因为以“this”开头的唯一 2-gram 序列是“this is”,而“is”的概率是 0.133332(10 的 -0.8751 次幂)。
在计算 N>1 的 N-grams 时,不是将单词序列的数量除以单词序列的总数,而是将该数量除以以相同单词序列开头的 (N-1)-grams 的数量。例如,与“this is”相关的概率是 log(2/2 * (1 - 0.4)) = -0.2218,因为单词“this”出现了 2 次,而序列“this is”也出现了 2 次。
这些计算并不复杂,但是,由于生成概率的计数之间存在许多依赖关系,而回退概率又需要预先确定其他相应概率,因此在所有数据都组织好并到位之前,是不可能完成计算的。生成语言模型的问题更多是一个索引问题。也就是说,算法在开始任何计算之前,都需要跟踪许多单词和单词序列的计数。我将在此展示如何使用 STL 容器来完成这项工作,而无需担心耗尽内存,即使可以处理数亿兆字节的数据。
Using the Code
步骤 1(可选):垃圾回收
为了开始我们的任务,我们将首先创建一个垃圾回收模板。虽然我们也可以在没有它的情况下实现目标,但我认为创建一个垃圾回收逻辑并在此基础上构建比到处混杂 delete 更清晰。
`std::auto_ptr` 模板几乎是完美的垃圾回收方案,但它缺少复制构造函数和赋值运算符。因此,我建议使用 `shared_auto_ptr`。我创建了一个 `contained_auto_ptr` 模板类,其目的是保存由 `shared_auto_ptr` 对象引用的唯一指针地址。该模板类是 `private` 的,只有 `shared_auto_ptr` 应该使用它。它的复制构造函数和赋值运算符是 `private` 的,以确保它们不会被其他人误用。`shared_auto_ptr` 由一个指向 `contained_auto_ptr` 的指针组成,该指针持有对其的引用计数。`shared_auto_ptr` 的构造函数、析构函数、复制构造函数和赋值运算符都已编写,以便相应地维护相关 `contained_auto_ptr` 对象的计数,一旦计数降至零,`contained_auto_ptr` 对象以及与之关联的内存都会被删除。只要用户严格遵守不将指针用于代码中的其他任何地方,这是一种仅有少量开销的防弹垃圾回收方法。
/* *************************** interfaces ************************** */
// - contained_auto_ptr:
// The contained_auto_ptr is the container of a unique ptr that holds a count
// of the referants.
// It hides the copy constructor and assignment operator in its private section
// in order to ensure that it is never constructed or assigned to another object
// without going through the shared_auto_ptr.
// There are no public method in this class since it is expected to exclusively be
// called by its friend class shared_auto_ptr<T>.
template <class T> class contained_auto_ptr
{
public:
friend class shared_auto_ptr<T>;
protected:
// The GetContainer acts as a static constructor, but constructs only if no
// other contained_auto_ptr<T> points to the same pointer p.
static contained_auto_ptr<T> *GetContainer(T* p = NULL) throw();
// In order to extract the value pointed by the object, use the get() method.
T *get() const throw();
// Destructor...
~contained_auto_ptr() throw();
private:
// Constructor: never call directly. Use GetContainer instead...
contained_auto_ptr(T* p = NULL) throw();
// This static map holds the equivalence between pointers and
// contained_auto_ptr objects already created.
static map<uintptr_t, contained_auto_ptr<T>*> m_equivalenceMap;
// We hide the copy constructor and the assignment operators to ensure they
// are never used.
contained_auto_ptr(const contained_auto_ptr&) throw();
void operator=(const contained_auto_ptr&) throw();
int m_count; // holds the count of the referants.
T *_ptr; // T* being held by the object.
};
// - shared_auto_ptr (*** PUBLIC INTERFACE ***):
// The shared_auto_ptr allows for the creation of a referant to a location in memory.
// It may create a contained_auto_ptr object in the event that it is the first referant
// of the memory location. Upon the destruction of the last referant, the
// memory allocated will be deleted.
template <class T> class shared_auto_ptr
{
public:
// Constructor:
// REQUIREMENTS:
// - p must point to the memory location to track.
// PROMISES:
// - upon destruction of the shared_auto_ptr object, p will be deleted only
// when the last referant will have been deleted.
shared_auto_ptr(T* p = NULL) throw();
// Destructor:
// REQUIREMENTS:
// - none.
// PROMISES:
// - upon destruction of the shared_auto_ptr object, the pointer provided
// upon construction or obtained through the assignment operator will be
// deleted only when the last referant will have been deleted.
~shared_auto_ptr() throw();
// Copy constructor:
// REQUIREMENTS:
// - none.
// PROMISES:
// - upon destruction of the shared_auto_ptr object, the pointer provided
// upon construction or obtained through the assignment operator will be
// deleted only when the last referant will have been deleted.
shared_auto_ptr(const shared_auto_ptr&) throw();
// assignment operator:
// REQUIREMENTS:
// - none.
// PROMISES:
// - upon destruction of the shared_auto_ptr object, the pointer provided
// upon construction or obtained through the assignment operator will
// be deleted only when the last referant will have been deleted.
void operator=(const shared_auto_ptr&) throw();
// get():
// REQUIREMENTS:
// - Do not use the returned pointer outside the scope where it was
// obtained (as when that scope is left, the shared_auto_ptr object may
// be deleted, and the memory associated with
// it also consequently).
// PROMISES:
// - Within the same scope, the returned pointer will be good.
T *get() const throw();
// -> operator:
// REQUIREMENTS:
// - Do not use the returned pointer outside the scope where it was
// obtained (as when that scope is left, the shared_auto_ptr object
// may be deleted, and the memory associated with it also consequently).
// PROMISES:
// - Within the same scope, the returned pointer will be good.
T *operator->() const throw() { return get(); }
private:
contained_auto_ptr<T> *m_contained_auto_ptr;
};
通过包含的代码中此模板的实现,我现在可以抽象我与指针的交互,而无需担心 delete。这使得代码更简洁、更可靠。
步骤 2:一种索引策略
通常,如果我要抽象我的内存问题,`std::map<string, struct>` 将是解决我的索引问题的理想选择。但是,考虑到我的内存问题,`std::map` 模板变得无用。它需要被另一种结构取代,该结构具有与 `std::map` 模板相同的特性,但在内存管理方面更具通用性。
为此,我选择了一个三向决策树。当填充了 2 个条目时,决策树的示例如下:
IndexStructure<int> tree;
tree["this"] = 3;
tree["test"] = 2;
t(NULL)
/|\
/ | \
/ | \
/ | \
/ | \
/ | \
/ h(NULL)\
NULL /|\ NULL
/ | \
/ | \
/ | \
/ | \
/ | \
e(NULL)| NULL
| is(3)
|
st(2)
决策树有一个头部节点,每个节点由一个小于节点、一个等于节点和一个大于节点的引用组成。索引策略本身并不具革命性,但它能完成工作。
// - IndexStructureNode:
// An IndexStructure<T> (public interface available)
// holds nodes of type IndexStructureNode<T> (no
// public interface). IndexStructureNode should not be used
// outside the context of a IndexStructure,
// and as a consequence to that, no public interfaces defined.
template <class T> class IndexStructureNode
{
public:
friend class IndexStructure<T>;
friend class IndexStructureIterator<T>;
~IndexStructureNode() throw();
protected:
IndexStructureNode(const IndexStructureNode<T> ©) throw();
IndexStructureNode<T> &operator=(const IndexStructureNode<T> ©) throw();
unsigned long GetNodeId() const throw() { return m_nodeId; }
T *get() const throw() { return m_data.get(); }
IndexStructureNode(IndexStructure<T> &owner, bool assignId = true) throw();
static IndexStructureNode<T> *ReadNode(IndexStructure<T> *owner,
fstream &dStream, unsigned long dNodeId) throw(fstream::failure);
void ReadNodeData(fstream &dStream) throw(fstream::failure);
void WriteNode(fstream &dStream) throw(fstream::failure);
void WriteNodeData(fstream &dStream) throw(fstream::failure);
string GetKey() throw();
inline void SetDirty() throw();
inline void FlagReference() throw();
private:
unsigned long m_nodeId; // the unique node id
// (used in m_parentNode,
// m_equalChildNode, m_smallerChildNode,
// m_greaterChildNode)
string m_index; // the index portion from the node
mutable unsigned long m_lastAccess; // the sequential last access to the node
// (used to notify to dump to disk)
unsigned long m_parentNode; // the parent node of this node
// (or 0 if there are no parent node)
unsigned long m_equalChildNode; // the node to go to if the test
// equals m_index
unsigned long m_smallerChildNode; // the node to go to if the test is
// smaller than m_index
unsigned long m_greaterChildNode; // the node to go to if the test is
// greater than m_index
shared_auto_ptr<T> m_data; // the data held (if any) by the node
IndexStructure<T> &m_owner; // the IndexStructure context
// owning the node
bool m_dirty; // a dirty flag referred to
// prior to dumping to disk
};
定义了 `IndexStructureNode` 模板后,`IndexStructure` 可以按如下方式定义:
#define MEMORYMAP(T) std::map<unsigned long, shared_auto_ptr<IndexStructureNode<T> > >
template <class T> class IndexStructure
{
public:
friend class IndexStructureNode<T>;
friend class IndexStructureIterator<T>;
// Constructor to build an IndexStructure from memory
IndexStructure(unsigned int nodeDataLength, unsigned int nodeIndexLength = 6,
unsigned int maxNodesInMemory = 2000) throw(FileErrorException,
fstream::failure);
// IndexStructure destructor
~IndexStructure() throw();
// Accessor of IndexStructure
unsigned long GetNodesCount() const throw() { return m_nodesCreationCount; }
unsigned long GetNodesInMemoryCount() const throw()
{ return m_memoryNodes.size(); }
bool HasData(const string index) throw();
// Mutator of IndexStructure
T& operator[] (const string index) throw(AlgorithmErrorException);
// REQUIREMENTS:
// - None.
// PROMISES:
// - Upon return, less than m_maxNodesInMemory will be in memory.
void PurgeFromMemory() throw();
protected:
// REQUIREMENTS:
// - The node needs to be fully constructed and populated with the index
// (without the need to ensure
// the m_index respects the size constraint).
// PROMISES:
// - Upon return, new node(s) may get created and
// the size of m_index will respect the constraints
// of m_nodeIndexlength.
// - The return value will point to the last node created
// (the one that should contain the data).
shared_auto_ptr< IndexStructureNode<T> >
AdaptNodeIndexSize(shared_auto_ptr< IndexStructureNode<T> > &node) throw();
// REQUIREMENTS:
// - A good node id in parameter.
// PROMISES:
// - Upon return, the node will have been loaded from disk into memory.
shared_auto_ptr<IndexStructureNode<T>> LoadNode(unsigned long dNodeId)
throw(BadNodeIdErrorException, fstream::failure);
// REQUIREMENTS:
// - An opened fstream with the mode in and out and binary.
// PROMISES:
// - The header in the file will be updated accordingly.
void WriteHeader(fstream &dStream) throw(fstream::failure);
// REQUIREMENTS:
// - An opened fstream with the mode in and out and binary.
// - m_headerSize contains the right header size.
// - A call to WriteHeader on the fstream will have been previously be made.
// PROMISES:
// - The file may be resized to ensure node can be written in it.
// - The seek position will be placed at the right offset
// to start writing the node.
void SeekStreamToNodePosition(fstream &dStream, unsigned long dNodeId)
throw(fstream::failure);
// REQUIREMENTS:
// - A node that may have child nodes.
// PROMISES:
// - The child node's parent will correctly point back to the node
// passed as a parameter.
void AdaptParentsInChildNodes(shared_auto_ptr< IndexStructureNode<T> >
&dNode) throw();
private:
string m_fileName;
unsigned int m_nodeIndexLength;
unsigned int m_nodeDataLength;
unsigned int m_maxNodesInMemory;
fstream m_cacheStream;
mutable unsigned long m_nodesCreationCount;
mutable unsigned long m_accessCount;
mutable MEMORYMAP(T) m_memoryNodes;
mutable int m_headerSize;
mutable unsigned long long m_lastAccessTotal;
static const char *m_zeroBuffer;
};
需要注意的重要事项是 `m_memoryNodes`、`m_lastAccess` 和 `m_maxNodesInMemory` 的存在。`m_memoryNodes` 是未针对密集内存使用进行优化的 `std::map`,而 `m_maxNodesInMemory` 指定了任何给定时间内存中可以有多少个节点。`m_lastAccess` 保存一个按顺序分配的值,以便我们稍后确定对该 `IndexStructureNode` 的访问是多久以前发生的(我们希望将最近引用的节点保留在内存中,并在清理时丢弃较旧的节点)。对于 `IndexStructure` 缓存到磁盘而言,每个 `IndexStructureNode` 对象都必须是可序列化的,这一点很重要。
在维护每个 `IndexStructureNode` 对象中的 `m_lastAccess` 的同时,我还维护了相关 `IndexStructure` 对象中的 `m_lastAccessTotal`。这将保持总数,以便我稍后可以使用它来进行平均,并确定对单个 `IndexStructureNode` 的引用是否足够旧,可以被清除。
为了使磁盘缓存高效,如果 `IndexStructureNode` 对象在磁盘上的大小是可预测的,那将很有帮助,因为我们可以简单地通过依赖于其 ID 的计算来偏移其在文件中的位置。一旦所有这些都就绪,就会按如下方式定期调用内存清理:
template <class T> void IndexStructure<t>::PurgeFromMemory() throw()
{
if (m_memoryNodes.size() > m_maxNodesInMemory)
{
MEMORYMAP(T) swapMap;
unsigned long long newTotal = 0;
WriteHeader(m_cacheStream);
unsigned long avgAccess = (unsigned long)
(m_lastAccessTotal/m_memoryNodes.size());
avgAccess += (avgAccess/3);
{
for (MEMORYMAP(T)::iterator it = m_memoryNodes.begin();
it != m_memoryNodes.end(); it++)
{
if (it->second.get()->m_lastAccess < avgAccess)
{
it->second.get()->WriteNode
(m_cacheStream);
it->second = NULL;
}
if (it->second.get() != NULL)
{
newTotal += it->second.get()->
m_lastAccess;
swapMap[it->first] = it->second;
}
}
}
m_memoryNodes = swapMap;
m_lastAccessTotal = newTotal;
}
}
…并且需要始终调用一个特殊方法,该方法会在节点被缓存到磁盘的情况下,将其加载到内存中,实现如下:
template <class T> shared_auto_ptr<IndexStructureNode<T>>
IndexStructure<T>::LoadNode(unsigned long dNodeId)
throw(BadNodeIdErrorException, fstream::failure)
{
if (m_memoryNodes.find(dNodeId) != m_memoryNodes.end())
{
return m_memoryNodes[dNodeId];
}
else
{
if ((dNodeId > m_nodesCreationCount) || (dNodeId < 1))
{
throw(BadNodeIdErrorException("Bad node id."));
}
IndexStructureNode<T> *dNewNode =
IndexStructureNode<T>::ReadNode(this, m_cacheStream, dNodeId);
m_memoryNodes[dNodeId] = shared_auto_ptr
<IndexStructureNode<T>>(dNewNode);
return m_memoryNodes[dNodeId];
}
}
如果没有一些机制来迭代累积的数据,索引策略就不算完整。为此,创建了 `IndexStructureIterator` 模板类。
// - IndexStructureIterator (*** PUBLIC INTERFACE ***):
// In order to traverse a IndexStructure<t> tree,
// use an IndexStructureIterator<t> object.
// IndexStructureIterator<int> iterator(dIndex);
// for (iterator.Begin(); !iterator.End(); iterator++)
// {
// cout << "Data (" << iterator.GetKey() << "): " << *iterator.get() << "\n";
// }
// ... or, an IndexStructureIterator could traverse only a portion of
// an IndexStructure tree
// by calling it the following way (provided a second IndexStructureIterator it2):
// IndexStructureIterator<int> it(m_index);
// for (it.Reset(it2); !it.End(); it++)
// {
// cout << "Data (" << it.GetKey() << "): " << *it.get() << "\n";
// }
// The previous code snippet only outputs the nodes below the current node of it2.
// The GetKey method returns the primary key of the current node.
// The get method returns the T* of the current node.
// The IndexStructureIterator template allows for iterating an IndexStructure.
// An increment of an IndexStructureIterator will make it stop
// on the next IndexStructureNode that has
// some data associated to it.
template <class T> class IndexStructureIterator
{
public:
friend class IndexStructure<T>;
friend class IndexStructureNode<T>;
IndexStructureIterator(IndexStructure<T> &structure) throw();
IndexStructureIterator<T> operator++(int) throw();
T *get() const throw();
void Reset() throw();
void Reset(IndexStructureIterator<T> &it) throw();
bool End() throw();
string GetKey() throw();
IndexStructureIterator<T> Begin() throw();
private:
class CDecisionOnNode
{
public:
enum EDirection
{
goLeft = -1,
goCenter = 0,
goRight = 2,
goUp = 3
};
CDecisionOnNode(unsigned long dNodeId, EDirection dDirection):
m_nodeId(dNodeId), m_direction(dDirection) {}
unsigned long m_nodeId;
EDirection m_direction;
};
IndexStructureIterator<T> &operator++() throw() {}
void IterateToNextPopulatedNode() throw();
stack<CDecisionOnNode> m_decisions;
IndexStructure<T> &m_structure;
bool m_endReached;
shared_auto_ptr< IndexStructureNode<T> > m_curNode;
};
遍历树使用 `CDecisionOnNode` 堆栈来实现,以确保每个节点只被访问一次。通过维护 `CDecisionOnNode` 堆栈,算法已经对后续的查询(我应该去哪里)给出了答案,并且堆栈的大小限制在索引树的深度。算法如下:
template <class T> void IndexStructureIterator<T>::IterateToNextPopulatedNode() throw()
{
shared_auto_ptr< IndexStructureNode<T> > disqualifyNode = m_curNode;
if ((!m_endReached) && (!m_decisions.empty()))
{
bool skipNextTest = false;
do
{
CDecisionOnNode dDecision = m_decisions.top();
// Unexpected (may be because of multi-threading issues).
if (dDecision.m_nodeId != m_curNode->GetNodeId())
{
CDecisionOnNode otherDecision
(m_curNode->GetNodeId(), CDecisionOnNode::goLeft);
if (m_curNode.get()->get() != NULL)
{
m_decisions.push(otherDecision);
break;
}
else
{
dDecision = otherDecision;
m_decisions.push(otherDecision);
}
}
switch (dDecision.m_direction)
{
case CDecisionOnNode::goLeft:
m_decisions.top().m_direction =
CDecisionOnNode::goCenter;
if (m_curNode->m_smallerChildNode != 0)
{
m_curNode = m_structure.LoadNode
(m_curNode->m_smallerChildNode);
}
else
{
continue;
}
m_decisions.push(CDecisionOnNode
(m_curNode->GetNodeId(),
CDecisionOnNode::goLeft));
break;
case CDecisionOnNode::goCenter:
m_decisions.top().m_direction =
CDecisionOnNode::goRight;
if (m_curNode->m_equalChildNode != 0)
{
m_curNode = m_structure.LoadNode
(m_curNode->m_equalChildNode);
}
else
{
continue;
}
m_decisions.push(CDecisionOnNode
(m_curNode->GetNodeId(),
CDecisionOnNode::goLeft));
break;
case CDecisionOnNode::goRight:
m_decisions.top().m_direction =
CDecisionOnNode::goUp;
if (m_curNode->m_greaterChildNode != 0)
{
m_curNode = m_structure.LoadNode
(m_curNode->m_greaterChildNode);
}
else
{
continue;
}
m_decisions.push(CDecisionOnNode
(m_curNode->GetNodeId(),
CDecisionOnNode::goLeft));
break;
case CDecisionOnNode::goUp:
m_decisions.pop();
if ((m_curNode->m_parentNode != 0)
&& (!m_decisions.empty()))
{
m_curNode = m_structure.LoadNode
(m_curNode->m_parentNode);
disqualifyNode = m_curNode;
}
else
{
m_endReached = true;
}
break;
}
}
while ((!m_endReached) && ((m_curNode.get()->get() == NULL) ||
(disqualifyNode.get()->get() == m_curNode.get()->get()))
&& (!m_decisions.empty()));
}
if (m_curNode.get()->get() == NULL)
{
m_endReached = true;
}
}
步骤 3:将所有内容整合到语言模型计算类中
一旦这些服务到位,创建语言模型生成类(`CLMEngine`)就变得容易了。事实上,`CLMEngine` 类的实现代码量仅为 150 行(其中近 1/3 是注释),并且逻辑非常容易理解。更妙的是,它的计算永远不会受到任何内存限制问题的束缚。
double CLMEngine::CalculateOneProbability(string dKey, CNodeData dNodeData) throw()
{
if (dNodeData.m_level == 1)
{
// For 1-grams, the probability is the count of
// the event divided by the total amount of event (not unique),
// all that diluted the discounted mass. For example,
// for the node containing the word 'a', the probability would be
// how many times 'a' is found in the corpus divided by
// the total amount of words in the same corpus, then
// multiplied by the non-discounted mass.
return (double)((double)1.0-m_discount) *
(double)dNodeData.m_count/(double)m_levelEvents
[dNodeData.m_level-1].m_events;
}
else
{
// For N-grams (where N > 1), the probability is the amount of events
// divided by the amount of events in the N-1 gram starting
// with the same N-1 words. For example, for the node
// 'this is a' in the 3-grams, the probability is
// how many times that sequence
// is found in the corpus divided by how many times
// 'this is' is found in the same corpus, multiplied by the
// non-discounted mass.
return (double)((double)1.0-m_discount) *
(double)dNodeData.m_count/(double)m_index
[dKey.substr(0, dKey.length() -
dNodeData.m_lastTokenLength - 1)].m_count;
}
}
double CLMEngine::CalculateOneBackoffProbability
(IndexStructureIterator<cnodedata> &dLocationInTree) throw()
{
double dSum = 0;
IndexStructureIterator<cnodedata> it(m_index);
// We only iterate through the sub-nodes that were passed in parameter
// to this method.
for (it.Reset(dLocationInTree); !it.End(); it++)
{
// Under the node that was passed as a parameter,
// only N+1 grams are of interest to us, and from
// these, only the nodes that start with the words
// provided from the node passed in parameter.
// For example, to calculate the backoff calculation of
// 'is a', we would need to sum the probabilities
// of all 1-gram nodes that complete the 3-gram nodes
// starting with 'is a', for example 'is a test'
// and 'is a second' (hence 'test' and 'second').
// That then needs to be diluted to the discounted mass.
if ((it.get()->m_level == (dLocationInTree.get()->m_level + 1)) &&
(dLocationInTree.GetKey().compare(it.GetKey().substr(0,
it.GetKey().length() - it.get()
->m_lastTokenLength - 1)) == 0))
{
string dKey = it.GetKey().substr
(it.get()->m_firstTokenLength + 1);
dSum += CalculateOneProbability(dKey, m_index[dKey]);
}
}
return (m_discount/(floor(((double)1-dSum)*100000+0.5)/100000));
}
关注点
利用本文提供的代码,可以在 2 分钟的分析时间内,从 1.5 MB 的语料库生成语言模型。
请注意,`IndexStructure` 类可以改进为从磁盘读取和写入。在语言模型计算的上下文中,这将有助于跳过单词计数阶段,并允许更快地合并多个语言模型。时至今日,合并语言模型的研究很少,部署更少。现在我们正朝着语言模型专业化发展,可以很容易地想到一个同时是放射科医生和肿瘤科医生的医生。可以为他提供一个同时包含两个语料库的语言模型,并且对于这些计算,如果计数文件已经保存在磁盘上,那么合并过程就仅仅是生成一个求和索引树,然后执行语言模型计算本身(从而节省了再次计数单词的大量时间)。
话虽如此,对于那些不像我一样热爱语音识别的人来说,我希望本文提出的内存管理技术能为多年来给了我很多帮助的社区中的一些人带来益处。
历史
- 2008 年 12 月 10 日星期三:文章创建