多态容器模板性能 STL C++11 字符串算法设计/图形中级 C++

高性能静态缓冲字符串

Shvetsov Evgeniy

4.66/5 (20投票s)

2015 年 12 月 25 日

MIT

9分钟阅读

58521

267

具有自动管理内存池和性能调整以及支持模块的字符串类

引言

此类的创建旨在取代 POD C 字符串（CS）的概念，该概念显然设计不佳（我个人认为这是最糟糕的解决方案之一，与空指针并列）。与 CS 相比，StaticallyBufferedString 显著提升了速度、安全性和功能性。

背景

随着 C++ 的引入，C 风格事物有了替代品：指针 -> 引用，POD C 字符串 -> std::string，malloc/free -> new/delete，结构体 -> 类，宏 -> 模板，#define MAX 127 -> static const int MAX = 127 等。

std::string 类是一个良好、可接受的解决方案，除了一点：内存池模型。它默认是动态的。尽管它受益于实时大小控制的可能性，但使用堆内存的开销有时过高。无论是锁（如果我们没有以某种方式使用私有堆）还是碎片都会导致此问题。通过预留内存来优化内存分配是提升整体性能的好方法，但它仍然涉及所有基于堆的分配问题。当然，您可以提供自己的分配器供 std::string 类使用，但编写自己的分配器并非易事，也不是我最初的重点。

使用具有固定容量内存池的容器的另一个原因是（特别是对于字符串）在大多数情况下，对象的平均和最大大小通常是众所周知的（由于预定义的限制），或者至少不难预测（通常通过数据统计）。

当使用内存池时，还有对内存分配失败的数学保证（假设您正确计算了所需的最小池大小）。

类描述

类方法通常旨在提供 std::string 类似的接口，但有一些值得注意的地方和一些需要描述的特定方法。

类模板旨在与不同类型的符号配合使用

// 'TElemType' - char / wchar_t / char16_t / char32_t (provides POD C str. if 'TElemType' is a char)
template<typename TElemType, const size_t MaxLen>

（尽管我只使用基本字符进行了精确测试，因为我个人并不真正需要其他字符类型，而且其他开发人员也很少使用）。

主要类特性是

// NOT thread safe, NO leaks, NO-throw guarantee

该类提供了一个自动生成的哈希函数对象

ADD_HASHER_FOR(TThisType, hasher)

来自 HashUtils 模块

#define ADD_HASHER_FOR(ClassName, HasherName) \
template <typename T = ClassName>\
struct HasherName {\
  auto operator()(const T& obj) const throw() -> decltype(obj.hash()) {\
    return obj.hash();\
  }\
};

以支持哈希容器。

为了提供与 STL 标准的兼容性，类将模拟对象类型定义为内部分配器类型（因为此处不涉及任何分配器）

typedef AllocatorInterface<TElemType> allocator_type; // mock object type

AllocatorInterface 模块非常简单且具有自描述性

ifndef AllocatorInterfaceH
#define AllocatorInterfaceH

#include <memory>

// C++ allocator concept: https://cppreference.cn/w/cpp/concept/Allocator
// Fake allocator, which can NOT AND would NOT work!
// Derived classes SHOULD specify the following properties:
//  'propagate_on_container_copy_assignment', 'propagate_on_container_move_assignment',
//   'propagate_on_container_swap' AND 'is_always_equal' (if req.)
template <class T>
class AllocatorInterface : public std::allocator<T> { // better derive from the 'allocator_traits'??
  
public:
  //// MS VS does NOT need this aliases, but GCC does
  typedef typename std::allocator<T> TBaseAllocatorType;
  typedef typename TBaseAllocatorType::pointer pointer;
  typedef typename TBaseAllocatorType::const_pointer const_pointer;
  typedef typename TBaseAllocatorType::reference reference;
  typedef typename TBaseAllocatorType::const_reference const_reference;
  typedef typename TBaseAllocatorType::size_type size_type;

  //// Three constructor versions must be defined (even if they do nothing) 
  ////  to allow for copy-constructions from allocator objects of other types

  AllocatorInterface() = default;

  AllocatorInterface(const AllocatorInterface&) = default;

  template <class U>
  AllocatorInterface(AllocatorInterface<U>&&) throw() {}

  // [!] NO member of the standard default allocator class template shall introduce data races,
  //  calls to member functions that allocate or deallocate storage shall occur in a single total order 
  //   and each such deallocation shall happen before the next allocation (if any) in this order [!]

  virtual ~AllocatorInterface() = default;
  
  virtual pointer address(reference ref) const throw() {
    return std::allocator<T>::address(ref);
  }

  virtual const_pointer address(const_reference ref) const throw() {
    return std::allocator<T>::address(ref);
  }

  //// Pure virtual
  virtual pointer allocate(size_type count, std::allocator<void>::const_pointer hint = nullptr) = 0;
  virtual void deallocate(pointer addr, size_type count) = 0;

  virtual size_type max_size() const throw() {
    return size_type();
  }

  //// 'construct ' AND 'destroy' are both template funcs.
  ////  so can NOT be virtual (they will be used from a 'std::allocator')

  //// Custom allocators may contain state:
  ////  each container or another allocator-aware object stores an instance of the supplied allocator
  ////   and controls allocator replacement through 'std::allocator_traits'

private:
};

template <typename T, class U>
bool operator==(const AllocatorInterface<T>&, const AllocatorInterface<U>&) throw() {
  return false;
}

template <typename T, class U>
bool operator!=(const AllocatorInterface<T>& alloc1, const AllocatorInterface<U>& alloc2) throw() {
  return !(alloc1 == alloc2);
}

#endif // AllocatorInterfaceH

字符串是 可迭代的，因为它使用通用随机访问迭代器

  typedef GenericRAIterator<StaticallyBufferedStringLight,
                            value_type, false, false> iterator;
  typedef GenericRAIterator<StaticallyBufferedStringLight,
                            value_type, true, false> reverse_iterator;
  
  typedef GenericRAIterator<StaticallyBufferedStringLight,
                            value_type, false, true> const_iterator;
  typedef GenericRAIterator<StaticallyBufferedStringLight,
                            value_type, true, true> const_reverse_iterator;

但是，非常量迭代器应非常小心地使用，因为使用它们修改内部字符序列是破坏对象状态（例如，一个简单的方法是在序列中间放置一个字符串终止符 - '\0'，从而改变实际字符串长度）和相关性（缓存哈希）的明确且简单的方法，这需要类强制恢复其状态，而这目前不支持。

我提供非 const 迭代器是为了增加类的整体灵活性和通用性，尽管我假设除非用户清楚地理解他在做什么，否则不应使用它们。

（有一种方法可以通过下标运算符提供一个代理对象来代替真实的字符引用来修复这种行为，但我不确定这是否真的需要）。

类的大多数函数都已模板化，以提供与其他可能的存储类型（如果它们满足要求）的兼容性。

  // Compatible with the ANY storage class which has 'TElemType' as an elem. type 
  //  AND provides the null-terminated str. by the public const member 'c_str'
  //   (like std::string, StaticallyBufferedStringLight<ANY SIZE> etc)
  template<typename TStorageType>
  StaticallyBufferedStringLight& operator=(const TStorageType& str) throw() {

使用 MemUtils 模块中的自定义 strCmp 函数来比较多种（和不同）类型的字符字符串

template<const size_t CmpChunkSize = sizeof(std::uintptr_t), // in bytes
         const bool SignedCmpChunk = false, // as 'strcmp' by default [7.21.4/1 (C99)]
         typename TElem1Type, typename TElem2Type>
typename IntegralTypeBySize<CmpChunkSize, true>::Type
// Used to compare strings with the diff. char types
// Works almost as fast as a 'strcmp' (at least for 'char' AND on release)
// [!] Does NOT checks an incoming args. on validity [!]
  strCmp(const TElem1Type* mem1, const TElem2Type* mem2) throw()
{
  typedef typename IntegralTypeBySize<sizeof(*mem1), false>::Type TUE1T; // unsigned both
  typedef typename IntegralTypeBySize<sizeof(*mem2), false>::Type TUE2T;
  typedef typename IntegralTypeBySize<CmpChunkSize, true>::Type TReturnType;
  // See http://stackoverflow.com/questions/1356741/strcmp-and-signed-unsigned-chars
  static_assert(sizeof(TReturnType) > sizeof(*mem1) &&
                sizeof(TReturnType) > sizeof(*mem2), "Incorrect elem. type size");

  if (mem1 == mem2) return TReturnType(); // optimization
  typename IntegralTypeBySize<CmpChunkSize, SignedCmpChunk>::Type elem1, elem2;
  while (true) {
    elem1 = static_cast<decltype(elem1)>(static_cast<TUE1T>(*mem1));
    elem2 = static_cast<decltype(elem2)>(static_cast<TUE2T>(*mem2));
    
    if (!elem1 || elem1 != elem2)
      return static_cast<TReturnType>(elem1) - elem2;
    ++mem1, ++mem2;
  }
}

还定义了流式运算符以将数字连接到字符串

  // Same as the 'append(const TValueType value)'
  template<typename TValueType,
           class = typename // remove from the overload resolution to avoid an ambiguity
             std::enable_if<std::is_arithmetic<TValueType>::value &&
                            !std::is_pointer<TValueType>::value && // 'std::nullptr_t' is OK
                            !std::is_same<TValueType, TElemType>::value>::type>
  StaticallyBufferedStringLight& operator<<(const TValueType value) throw() {
    append(value);
    return *this;
  }
  
  StaticallyBufferedStringLight& operator<<(const TElemType* const str) throw() {
    append(str);
    return *this;
  }

正如你所看到的，SFINAE 通过 std::enable_if 和其他类型特征工具在此处用于确保这些运算符仅与正确类型一起使用。

标准命名空间也进行了增强，以提高 STL 兼容性

namespace std {

  template<typename TElemType, const size_t MaxLen>
  bool operator==(const std::string& dStr,
                  const StaticallyBufferedStringLight<TElemType, MaxLen>& sStr) throw() {
    return sStr == dStr;
  }

  template<typename TElemType, const size_t MaxLen>
  bool operator==(const TElemType* const cStr,
                  const StaticallyBufferedStringLight<TElemType, MaxLen>& sStr) throw() {
    return sStr == cStr;
  }
  
  template<typename TElemType, const size_t MaxLen>
  std::ostream& operator<<(std::ostream& stream,
                           const StaticallyBufferedStringLight<TElemType, MaxLen>& str) {
    stream << str.c_str(); // OPTIMIZATION: reserve internal buffer first?
    return stream;
  }
  
  template<typename TElemType, const size_t MaxLen>
  std::istream& operator>>(std::istream& stream,
                           StaticallyBufferedStringLight<TElemType, MaxLen>& str) {
    auto symb = stream.get(); // skip first (it is '\n')
    while (true) {
      symb = stream.get(); // on EOF - 'failbit' flag is set
      switch (symb) {
        case '\n': case '\r': return stream; // escape char.
      }
      if (!stream) break; // true if either 'failbit' or 'badbit' flag is set
      if (!str.push_back(symb)) break;
    }
    return stream;
  }
};

由于字符串的容量非常受限，因此提供了一个特定方法来确定上次字符串更改操作是否导致实际数据截断（此方法仅返回一个内部标志，指示截断事件）

  // Constrcation, assertion OR appending strs can cause data truncation 
  //  due to the strictly limited size of the internal buffer
  //   in case of the incoming data truncation occures appropriated flag will be set 
  //    AND this func. will return true
  //     data truncation indication flag CAN be reseted by some operations
  bool truncated() const throw() {
    return truncated_;
  }

性能优化

内存池大小自动对齐到 64 位字大小（因为大多数现代处理器都是 x64）

  // Based on the the 'MaxLen', auto adjusted, to have a 8 byte alignment
  static const auto BUF_SIZE =
    MaxLen + 1U + (((MaxLen + 1U) % 8U) ? (8U - (MaxLen + 1U) % 8U) : 0U);

（尽管将其对齐到自然指针大小 - sizeof(std::uintptr_t) 可能会更好）

这里的优化方法之一是提供特定情况的特定处理，例如，用于将类实例与字符数组（包含 C 字符串）进行比较的比较运算符可以受益于已知数组大小

  // 'str' SHOULD contain the POD C str.
  // [!] More efficient then 'operator==(POD C str.)' (coze providing array size),
  //  BUT less effective, then 'operator==(const TStorageType& str)' [!]
  template<const size_t ArrayElemCount>
  bool operator==(const TElemType (&str)[ArrayElemCount]) const throw() {
    if (str == data_) return true;
    // Probably points to the substr. of the str. contatining in 'data_'
    if (checkAddrIfInternal(str)) return false;
    
    if (ArrayElemCount < (length_ + 1U)) return false; // optimization
    
    const auto thisStrMemChunkSize = sizeof(*data_) * (length_ + 1U); // in bytes
    return isEqualMemD<>(data_, str, thisStrMemChunkSize);
  }

（但是，看起来它并没有按预期工作，因为，至少在我的 MS VS 2013 Community update 5 中，它根本没有被调用）。

这里的比较由 MemUtils 模块中的自定义 isEqualMemD 函数执行

// As 'memcmp' by default (http://www.cplusplus.com/reference/cstring/memcmp/)
template<const bool SignedCmpChunk = false>
// 'memSize' is in bytes
// D - Dispatcher
//  (uses either 64 OR 32 bit chunks, depending on the CPU type, to better fit the register size)
// [!] Faster then 'memCmp<>' OR 'memcmp', use instead of 'isEqualMem' [!]
bool isEqualMemD(const void* const mem1, const void* const mem2, const size_t memSize) throw() {
  typedef const typename IntegralTypeBySize<8U, SignedCmpChunk>::Type T64Bit;
  typedef const typename IntegralTypeBySize<4U, SignedCmpChunk>::Type T32Bit;
  typedef const typename IntegralTypeBySize<1U, SignedCmpChunk>::Type T8Bit;
  
  if (memSize < 8U) {
    return !memcmp(mem1, mem2, memSize);
  }
  switch (CPUInfo::INSTANCE.is64BitCPU) {
    case true: // 64 bit
      assert(memSize >= 8U);
      if (!isEqualMem<8U, SignedCmpChunk>(mem1, mem2, memSize / 8U)) return false;
      // Check the remain (NOT fitted) bytes; sizeof(char) == 1
      return *(reinterpret_cast<T64Bit*>(static_cast<T8Bit*>(mem1) + memSize - 8U)) ==
             *(reinterpret_cast<T64Bit*>(static_cast<T8Bit*>(mem2) + memSize - 8U));
      
    default: // 32 bit
      assert(memSize >= 4U);
      if (!isEqualMem<4U, SignedCmpChunk>(mem1, mem2, memSize / 4U)) return false;
      return *(reinterpret_cast<T32Bit*>(static_cast<T8Bit*>(mem1) + memSize - 4U)) ==
             *(reinterpret_cast<T32Bit*>(static_cast<T8Bit*>(mem2) + memSize - 4U));
  }
}

此函数只是一个调度器，它根据 CPU 类型调用实际的比较函数（以受益于 CPU 寄存器大小）

// [!] Does NOT checks an incoming args. on validity [!]
// Use this when the fact itself of a difference of the two memory chunks is meaningfull,
//  NOT the actual difference value
// [!] Works a bit faster then 'memcmp' [!]
bool isEqualMem(const void* const mem1, const void* const mem2, const size_t iterCount) throw() {
  typedef typename IntegralTypeBySize<CmpChunkSize, SignedCmpChunk>::Type TCmpChunkType;

  if (mem1 == mem2) return true; // optimization
  auto mem1re = static_cast<const TCmpChunkType*>(mem1); // reinterpreted
  auto mem2re = static_cast<const TCmpChunkType*>(mem2);
  
  const auto end = mem1re + iterCount;
  while (mem1re < end) {
    if (*mem1re != *mem2re) return false;
    ++mem1re, ++mem2re;
  }
  return true;
}

当前 CPU 是 x32 还是 x64 的检测由 HardwareUtils 模块执行

#ifndef HardwareUtilsH
#define HardwareUtilsH

#ifdef _MSC_VER
  #include <cstring>
  #include <intrin.h> // Microsoft Specific
#elif __GNUC__
  #include <cpuid.h> // GCC
#endif

class CPUInfo {

public:
  // Static singleton (static init. is a thread safe in C++11)
  static const CPUInfo INSTANCE;

  const bool is64BitCPU = false;

private:

  CPUInfo() throw()
    : is64BitCPU(findIs64BitCPU())
  {}

  ~CPUInfo() = default;

  CPUInfo(const CPUInfo&) throw() = delete;
  CPUInfo(CPUInfo&&) throw() = delete;
  CPUInfo& operator=(const CPUInfo&) throw() = delete;
  CPUInfo& operator=(CPUInfo&&) throw() = delete;

  /* Reference:
  http://stackoverflow.com/questions/12212385/detect-if-the-processor-is-64-bit-under-32-bit-os
  https://en.wikipedia.org/wiki/CPUID
  https://en.wikipedia.org/wiki/Long_mode
  https://msdn.microsoft.com/en-us/library/hskdteyh(v=vs.120).aspx
  https://msdn.microsoft.com/en-us/library/windows/desktop/ms684139(v=vs.85).aspx
  http://stackoverflow.com/questions/14266772/how-do-i-call-cpuid-in-linux
  */
  // HINT: Better rewrite this using an assembly
  //  (see examples: https://en.wikipedia.org/wiki/CPUID#CPUID_usage_from_high-level_languages)
  static bool findIs64BitCPU() throw() {
    static const auto GET_MAX_CMD_SUPPORTED = 0x80000000U; // 2 147 483 648
    
    unsigned int cpuInfo[4U] = {0}; // from EAX, EBX, ECX, and EDX
    #ifdef _MSC_VER
      auto const cpuInfoRe = reinterpret_cast<int*>(cpuInfo); // reinterpreted
      __cpuid(cpuInfoRe, GET_MAX_CMD_SUPPORTED);
      // Max. value of 'function_id' supported for extended functions
      const auto maxFuncIDSupported = *cpuInfo; // at 'EAX'
    #elif __GNUC__
      // 'ext' SHOULD be 0x8000000 to return highest supported value for extended 'cpuid' information
      // Returns 0 if 'cpuid' is not supported or whatever 'cpuid' returns in EAX register
      const auto maxFuncIDSupported = __get_cpuid_max(GET_MAX_CMD_SUPPORTED, nullptr);
    #else
      static_assert(false, "Unsupported complier"); // __BORLANDC__, __MINGW32__
    #endif
    
    // Get Extended Processor Info and Feature Bits
    static const auto GET_EXTENDED_INFO  = 0x80000001U; // 2 147 483 649
    // If does NOT supports extended flags
    if (maxFuncIDSupported < GET_EXTENDED_INFO) return false;
    #ifdef _MSC_VER
      memset(cpuInfo, 0, sizeof(cpuInfo));
      __cpuid(cpuInfoRe, GET_EXTENDED_INFO);
    #elif __GNUC__
      // Checks if 'cpuid' is supported and returns 1 for valid cpuid information or
      //  0 for unsupported cpuid level
      if (!__get_cpuid(GET_EXTENDED_INFO, cpuInfo, cpuInfo + 1U, cpuInfo + 2U, cpuInfo + 3U))
        return false;
    #endif
    
    //' LM' (Long Mode) flag for AMD / 'EM64T' for Intel
    static const auto LONG_MODE_BIT = 536870912U; // 2 pow 29: 29-th bit
    // Check if bit is signaled
    return 0U != (LONG_MODE_BIT & cpuInfo[3U]); // from EDX (cpuInfo[3U])
  }

};

#endif // HardwareUtilsH

正如我之前提到的，字符串支持哈希码生成（使用 HashUtils 模块中低冲突的 getFNV1aHash 函数），这允许将字符串实例存储在高效的搜索数据结构中，并且项目查找的平摊时间为常量。但该类还支持智能哈希缓存 - 它存储已知的哈希码值并返回它（如果仍然相关），而不是重新计算它

  //// [!] Hashing algo. SHOULD never be changed at the runtime (must be deterministic) [!]

  // Uses FNV-1a algorithm (ONLY for the one byte chars - char / unsigned char!!)
  size_t hash() const throw() {
    static_assert(1U == sizeof(TElemType), "'TElemType' SHOULD be 1 byte long!");

    if (modified_) { // recalc. rather then returning cached value
      hash_ = MathUtils::getFNV1aHash(data_);
      modified_ = false;
    }
    return hash_;
  }

比较运算符也受益于哈希缓存：如果比较两个具有某些兼容类型的存储实例，并且允许哈希码计算和缓存，假设哈希码已经计算（使用相同的哈希算法）并且对两个实例仍然相关 - 如果哈希码不同，则实例不相等（从而使比较复杂度为常数而不是线性）

  // Compatible with the ANY storage class which has 'TElemType' as an elem. type 
  //  AND provides the null-terminated str. by the public const member 'c_str'
  //   AND returning actual str. len. with the public const member 'length'
  //    (like std::string, StaticallyBufferedStringLight<ANY SIZE> etc)
  // [!] The most effective type of comparing strs is 
  //      comparing a one 'StaticallyBufferedStringLight' to another (coze providing hash) [!]
  template<class TStorageType>
  bool operator==(const TStorageType& str) const throw() {
    if (str.length() != length()) return false;
    
    // Using hash code. algo. ID to identify if the same algo. is used by the 'str' instance
    // OPTIMIZATION HINT: use C++14 'constexpr' here
    static const auto SAME_HASHING_ = (hashAlgoID() == hashAlgoID::ExecIfPresent(str));
    if (SAME_HASHING_) {
      const auto thisHash = getHashIfKnown(); // ONLY if already calculated
      if (thisHash) { // if this hash is relevant
        // If 'TStorageType' provides hash code 
        //  AND hash code is ALREADY calculated (cached) AND relevant
        //   for the BOTH compared objects - if codes does NOT equal -> 
        //    objects is clearly diffirent (return false)
        // Called ONLY if exists (if NOT - called surrogate which is ALWAYS return zero)
        const auto otherHash = getHashIfKnown::ExecIfPresent(str);
        // REMEMBER that hash code equivalence does NOT actually means that object are equal
        //  due to the non-nill collison probabilty
        if (otherHash && otherHash != thisHash) return false; // if other hash is known AND relevant
      }
    }
    static const auto SAME_CHAR_TYPE_ = // remove cv and ref.
      std::is_same<typename std::decay<decltype(*c_str())>::type,
                   typename std::decay<decltype(*str.c_str())>::type>::value;
    switch (SAME_CHAR_TYPE_) {
      case true: return isEqualMemD<>(data_, str.c_str(), sizeof(*data_) * length());
      // Diff. types: call 'operator==(const TOtherElemType* const str)'
      default: return *this == str.c_str();
    }
  }

这里使用 hashAlgoID 和 getHashIfKnown 辅助函数，它们非常不言自明

  // Returns unique ID of the hashing algo. used to calculate
  //  a value returned by the 'hash' AND 'getHashIfKnown' routines
  // [!] SHOULD never return zero [!]
  // Define AND use this if you planning to compare this instance with the other instances
  //  which using the same hash code calculation algorithm [OPTIMIZATION]
  // OPTIMIZATION HINT : use C++14 'constexpr' here
  static size_t hashAlgoID() throw() {
    static const size_t ID_ = 'F' + 'N' + 'V' + '-' + '1' + 'a';
    return ID_;
  }

  // NEVER recalculates hash
  // Returns zero if actual hash is unknown OR if str. is empty
  size_t getHashIfKnown() const throw() {
    return modified_ ? size_t() : hash_;
  }

它们都实际上是使用 ExecIfPresent 惯用法调用的

EXEC_MEMBER_FUNC_IF_PRESENT(getHashIfKnown, size_t())
EXEC_MEMBER_FUNC_IF_PRESENT(hashAlgoID, size_t())

因此，不支持此机制的存储类型（例如标准 std::string）仍然可以与此类别一起使用，从而使其更通用和灵活。

等价比较运算符也可以受益于哈希码缓存，如果（并且仅当）证明哈希码生成算法以这样的方式生成哈希值：较大的（就比较而言）字符串的哈希值也总是更大

  // [!] The most effective type of comparing strs is 
  //      comparing a one 'StaticallyBufferedStringLight' to another (coze providing hash) [!]
  // 'TStorageType' SHOULD provide POD C str. with it's 'data' function
  template<class TStorageType>
  bool operator<(const TStorageType& str) const throw() {
    // Using hash code. algo. ID to identify if the same algo. is used by the 'str' instance
    // OPTIMIZATION HINT: use C++14 'constexpr' here
    static const auto SAME_HASHING_ = (hashAlgoID() == hashAlgoID::ExecIfPresent(str));
    // If proved that the hash code of a larger str. is larger - 
    //  we can just check the hash code here [OPTIMIZATION]
    //   (ONLY if the hash code algo. is ALWAYS generates a lager hash for a larger str., FNV-1a is NOT)
    if (SAME_HASHING_ && HashCodeChecker::INSTANCE.hashOfLargerStrLarger) {
      // Get hash ONLY if already known [OPTIMIZATION]
      const auto thisHashCached = getHashIfKnown(), otherHashCached = str.getHashIfKnown();
      if (thisHashCached && otherHashCached && (thisHashCached != otherHashCached)) {
        // Equal caches does NOT necessary means the strs is actually equal, 
        //  due to collison probability
        return thisHashCached < otherHashCached;
      }
    }
    static const auto SAME_CHAR_TYPE_ = // remove cv and ref.
      std::is_same<typename std::decay<decltype(*c_str())>::type,
                   typename std::decay<decltype(*str.c_str())>::type>::value;
    switch (SAME_CHAR_TYPE_) {
      case true:
        if (static_cast<const void*>(data_) == static_cast<const void*>(str.data())) return false;
        return 0 > memcmp(data(), str.data(),
                          sizeof(*data_) * (std::min<>(length(), str.length()) + 1U));
      default:
        return *this < str.data(); // diff. types: call 'operator<(const TOtherElemType* const str)'
    }
  }

然而，FNV-1a 算法不满足此要求，这已由自动化测试模块 HashCodeChecker 证明

#include "StaticallyBufferedStringLight.h"

struct HashCodeChecker::HashChecker {
  size_t operator()(const std::string& str) const throw() {
    if (!COLLECT_STATS && !NEXT_HASH_LARGER_OR_EQ) return 0U; // skip [OPTIMIZATION]
    STR = str;

    const auto currHash = STR.hash();
    const auto nextLargerOrEq = (currHash >= PREV_HASH);
    nextLargerOrEq ? ++STATS[1U] : ++*STATS;
    NEXT_HASH_LARGER_OR_EQ &= nextLargerOrEq;
    PREV_HASH = currHash;

    return currHash;
  }

  static bool COLLECT_STATS; // will be initially false as a static
  // First - next lower count (NOT OK!), second - next larger count (OK); both zero at the start
  static size_t STATS[2U]; // first: NOT larger, second: next larger count
  static size_t PREV_HASH; // will be initially zero as a static
  static bool NEXT_HASH_LARGER_OR_EQ; // initially true!!
  static StaticallyBufferedStringLight<char, 7U> STR;
};

bool HashCodeChecker::HashChecker::COLLECT_STATS;
size_t HashCodeChecker::HashChecker::STATS[2U];
size_t HashCodeChecker::HashChecker::PREV_HASH;
bool HashCodeChecker::HashChecker::NEXT_HASH_LARGER_OR_EQ = true;
decltype(HashCodeChecker::HashChecker::STR) HashCodeChecker::HashChecker::STR;

template <const bool SkipStatistics>
bool HashCodeChecker::ishashOfLargerStrLarger() throw() {
  //// Test A
  static const auto MAX_LEN_ = 55U;
  static_assert('A' + MAX_LEN_ < 128, "Incorrect max. len.");
  StaticallyBufferedStringLight<char, MAX_LEN_> str;

  decltype(str.hash()) prevHash = str.hash(), nowHash = 0U;
  auto currChar = 'A'; // from 'A' to 'A'+MAX_LEN_
  auto nextHashLargerOrEqCount = 0U, nextHashLowerCount = 0U;

  for (auto charIdx = 0U; charIdx < str.max_size(); ++charIdx) {
    str += currChar;
    ++currChar;
    nowHash = str.hash();

    nowHash >= prevHash ? ++nextHashLargerOrEqCount : ++nextHashLowerCount;
    if (SkipStatistics && nextHashLowerCount) return false;
    prevHash = nowHash;
  }
  //// Test B
  char strBuf[4U] = {0};
  HashTester::Params<std::extent<decltype(strBuf)>::value> params(strBuf);
  const auto result = HashTester::createAndTestCharSeq<HashChecker>(params, false);

  return !nextHashLowerCount && result && HashChecker::NEXT_HASH_LARGER_OR_EQ;
}

#ifdef _MSC_VER
  // GCC might NOT build with this, while MS VS 2013 will NOT build withOUT this
  const HashCodeChecker HashCodeChecker::INSTANCE;
#endif

（此模块使用 HashTester 模块）。

还有一个特定的 append 方法，用于执行优化后的（因为它跳过了双缓冲复制 / 临时对象创建和销毁 / 堆内存使用，例如，与标准 std::to_string 不同）连接数字的字符串表示

  // Adds a WHOLE number in a str. representation
  //  (be carefull here, do NOT miss with the 'operator+=(const TElemType symb)')
  template<typename TValueType,
           class = typename // remove from the overload resolution to avoid an ambiguity
             std::enable_if<std::is_arithmetic<TValueType>::value &&
                            !std::is_pointer<TValueType>::value && // 'std::nullptr_t' is OK
                            !std::is_same<TValueType, TElemType>::value>::type>
  StaticallyBufferedStringLight& append(const TValueType value) throw() {
    const auto spaceLeft = std::extent<decltype(data_)>::value - length_;
    if (spaceLeft < 2U) { // NO actual space left (1 for the str. terminator)
      truncated_ = true;
      return *this;
    }
    const char* mask = "";
    auto returnFlag = false; // true if return immediately
    auto getMask = [&]() throw() {
      typedef typename TypeTag<decltype(value)>::TypeTags_ Tags;
      switch (TYPE_TAG(value)) {
        // OPTIMIZATION thoughts: reduce the mask count?
        //  use 'if std::is_floatinng<decltype(value)>' for two mask types - float|integral?

        default: assert(false); // unidentified type
        case Tags::NULLPTR: returnFlag = true; break;

        case Tags::BOOL: mask = value ? "true" : "false"; break;
        
        case Tags::SIGNED_CHAR: mask = "%hhi"; break;
        case Tags::SIGNED_SHORT_INT: mask = "%hi"; break;
        case Tags::SIGNED_INT:
        case Tags::WCHAR: // signedness of wchar_t is unspecified
          mask = "%i"; break;
        case Tags::SIGNED_LONG_INT: mask = "%li"; break;
        case Tags::SIGNED_LONG_LONG_INT: mask = "%lli"; break;

        case Tags::UNSIGNED_CHAR: mask = "%hhu"; break;
        case Tags::UNSIGNED_SHORT_INT: mask = "%hu"; break;
        case Tags::UNSIGNED_INT: mask = "%u"; break;
        case Tags::UNSIGNED_LONG_INT: mask = "%lu"; break;
        case Tags::UNSIGNED_LONG_LONG_INT: mask = "%llu"; break;

        case Tags::FLOAT:
        case Tags::DOUBLE:
          mask = "%f"; break;
        case Tags::LONG_DOUBLE: mask = "%Lf";
      }
    };
    getMask();
    if (returnFlag) return *this;
    
    #ifdef _MSC_VER // MS VS specific
      // Number of characters stored in buffer, not counting the terminating null character
      auto count = _snprintf_s(data_ + length_, spaceLeft, _TRUNCATE, mask, value);
      if (count < 0) { // if NOT enough space left
        count = spaceLeft - 1U;
        truncated_ = true;
      }
    #else
      // Number of characters that properly SHOULD have been written (except terminating null character)
      auto count = snprintf(data_ + length_, spaceLeft, mask, value);
      if (count < 0) {
        data_[length_] = '\0'; // ensure NO actual changes
        return *this; // encoding error
      }
      if ((count + 1U) > spaceLeft) { // +1 for the str. terminator
        count = spaceLeft - 1U;
        truncated_ = true;
      }
    #endif

    length_ += count;
    modified_ = true;
    return *this;
  }

速度比较

现在是真正的考验!

（所有测试均在发布模式下进行，所有优化均开启，在 MS VS Community 2013 update 5 中构建 x32，在我运行 x64 Win 8.1 的联想 IdeaPad S400 x64 笔记本电脑上运行）

1) 连接速度测试

此测试对字符串实例执行多次（约 400,000 次）最小（单个字符）的连接操作。对于 std::string 来说是一个好情况（因为只有一小部分连接会导致长度大于当前容量的情况，从而需要进行重新分配，并且这个百分比可能会随着时间的推移而减少，具体取决于实际的字符串实现），对于 C 字符串来说是最差情况（因为它需要在实际添加之前确定 [以线性复杂度] 字符串长度，这导致每次调用时间复杂度以算术级数增长），对于 静态字符串 来说是不关心的情况。

std::string:

- 最佳：在连接之前预分配所有所需空间。

- 最差：添加长度以几何级数增长的字符串，使得每次添加都将需要进行重新分配

C 字符串.:

- 最佳：一次向空字符串添加

- 最差：对长初始字符串进行多次最小的连接

静态字符串：所有情况都相同

测试代码:

    static const auto STR_BUF_SIZE_ = 440U * 1024U;
    CStr<STR_BUF_SIZE_ - 1U> cstr___1_;
    StaticallyBufferedStringLight<char, STR_BUF_SIZE_ - 1U> static_str___1_;
    std::string dynamic_str___1_;

    cstr___1_.clear();
    std::cout << cstr___1_.strBuf; // to force the compiler to generate the actual code
    const volatile auto cstr_test_time_ =
      TestConcat(cstr___1_, STR_BUF_SIZE_ - 1U).count();
    cstr___1_.clear();
    std::cout << cstr___1_.strBuf; // to force the compiler to generate the actual code

    static_str___1_.clear();
    std::cout << static_str___1_; // to force the compiler to generate the actual code
    const volatile auto static_str_test_time_ =
      TestConcat(static_str___1_, STR_BUF_SIZE_ - 1U).count();
    cstr___1_.clear();
    std::cout << cstr___1_.strBuf; // to force the compiler to generate the actual code
    
    dynamic_str___1_.clear();
    std::cout << dynamic_str___1_; // to force the compiler to generate the actual code
    const volatile auto dynamic_str_test_time_ =
      TestConcat(dynamic_str___1_, STR_BUF_SIZE_ - 1U).count();
    cstr___1_.clear();
    std::cout << cstr___1_.strBuf; // to force the compiler to generate the actual code

    std::cout << "static str: " << static_str_test_time_ << std::endl;
    std::cout << "dynamic str: " << dynamic_str_test_time_ << std::endl;
    std::cout << "cstr: " << cstr_test_time_ << std::endl;
    std::cout << "\nStatic str.: 1.0, dynamic str.: "
              << static_cast<double>(dynamic_str_test_time_) / static_str_test_time_
              << ", POD C str.: "
              << static_cast<double>(cstr_test_time_) / static_str_test_time_ << std::endl;

template <class TStrType>
auto TestConcat(TStrType& str, const size_t maxLen) throw()
  -> decltype(std::chrono::system_clock::now() - std::chrono::system_clock::now())
{
  static const auto MIN_CHAR_ = 'a';
  static const auto MAX_CHAR_ = 'z';
  static_assert(MIN_CHAR_ < MAX_CHAR_, "Invalid chars");

  std::default_random_engine gen;
  std::uniform_int_distribution<int> distr(MIN_CHAR_, MAX_CHAR_);

  char strToAdd[2U] = {0};
  
  str.clear();
  const auto startTime = std::chrono::high_resolution_clock::now();

  for (size_t charIdx = 0U; charIdx < maxLen; ++charIdx) {
    *strToAdd = distr(gen); // constant complexity
    str += strToAdd; // linear to const complexity
  }
  const auto endTime = std::chrono::high_resolution_clock::now();

  return endTime - startTime;
}

template <const size_t MaxLen>
struct CStr {

  CStr() throw() {
    clear();
  }

  void clear() throw() {
    *strBuf = '\0';
  }

  void operator+=(const char* const str) throw() {
    strcat(strBuf, str);
  }

  char strBuf[MaxLen + 1U];
};

结果:

2) 内存比较例程速度测试

此测试涉及测试标准 C memcmp / strcmp 和我自定义的 MemUtils 模块中的比较函数的速度。

测试代码:

    const auto SIZE_ = 1024U * 1024U * 8U;
    auto mem1 = new char[SIZE_];
    memset(mem1, 200, SIZE_ - 1U);
    auto mem2 = new char[SIZE_];
    memset(mem2, 200, SIZE_ - 1U);
    mem1[SIZE_ - 1U] = '\0';
    mem2[SIZE_ - 1U] = '\0';
    
    // (64 bit OS, 64 bit CPU, 32 bit application)
    decltype(std::chrono::system_clock::now()) time1, time2;
    static const auto COUNT_001_ = 7U;
    static const auto TEST_COUNT_001_ = 40U;
    decltype((time2 - time1).count()) counts[COUNT_001_] = {0};
    auto memCmp_faster_then_memcmp_count = size_t();
    auto memCmp_faster_then_quickCmp = size_t();
    unsigned long long avg[COUNT_001_] = {0};
    const auto iterCount_1_ = 40U;

    for (size_t idx = 0U; idx < iterCount_1_; ++idx) {
      time1 = std::chrono::system_clock::now();
      volatile long long int r0 = 0LL;
      for (size_t testIdx = size_t(); testIdx < TEST_COUNT_001_; ++testIdx) {
        // 'strCmp<>' with 4U works MUCH faster, then with the 8U
        r0 = strCmp<>(mem1, mem2);
      }
      time2 = std::chrono::system_clock::now();
      *counts = (time2 - time1).count();
      *avg += *counts;

      time1 = std::chrono::system_clock::now();
      volatile long long int r1 = 0LL;
      for (size_t testIdx = size_t(); testIdx < TEST_COUNT_001_; ++testIdx) {
        // 'quickCmp' looks like with 8U works faster, then with the 4U
        //  (BUT we can't use 8U AND this func. is NOT used)
        r1 = quickCmp(mem1, mem2, SIZE_);
      }
      time2 = std::chrono::system_clock::now();
      counts[1] = (time2 - time1).count();
      avg[1U] += counts[1];

      time1 = std::chrono::system_clock::now();
      volatile long long int r2 = 0LL;
      for (size_t testIdx = size_t(); testIdx < TEST_COUNT_001_; ++testIdx) {
        r2 = memcmp(mem1, mem2, SIZE_);
      }
      time2 = std::chrono::system_clock::now();
      counts[2] = (time2 - time1).count();
      avg[2U] += counts[2];

      time1 = std::chrono::system_clock::now();
      volatile long long int r3 = 0LL;
      for (size_t testIdx = size_t(); testIdx < TEST_COUNT_001_; ++testIdx) {
        r3 = strcmp(mem1, mem2);
      }
      time2 = std::chrono::system_clock::now();
      counts[3] = (time2 - time1).count();
      avg[3U] += counts[3];

      time1 = std::chrono::system_clock::now();
      volatile long long int r4 = 0LL;
      const auto count_1_ = SIZE_ / sizeof(std::uintptr_t);
      for (size_t testIdx = size_t(); testIdx < TEST_COUNT_001_; ++testIdx) {
        // 'memCmp' with 8U works faster, then with 4U (BUT we can't do that)
        r4 = memCmp<>(mem1, mem2, count_1_);
      }
      time2 = std::chrono::system_clock::now();
      counts[4] = (time2 - time1).count();
      avg[4U] += counts[4];

      time1 = std::chrono::system_clock::now();
      volatile bool r5 = false;
      for (size_t testIdx = size_t(); testIdx < TEST_COUNT_001_; ++testIdx) {
        r5 = isEqualMemD(mem1, mem2, SIZE_);
      }
      time2 = std::chrono::system_clock::now();
      counts[5] = (time2 - time1).count();
      avg[5U] += counts[5];

      time1 = std::chrono::system_clock::now();
      volatile bool r6 = false;
      const auto count_02_ = SIZE_ / 8U;
      for (size_t testIdx = size_t(); testIdx < TEST_COUNT_001_; ++testIdx) {
        r6 = isEqualMem<8U>(mem1, mem2, count_02_);
      }
      time2 = std::chrono::system_clock::now();
      counts[6] = (time2 - time1).count();
      avg[6U] += counts[6];

      std::cout << "\n";
      std::cout << "count6 - isEqualMem<8U>    : " << counts[6] << "\n";
      std::cout << "count5 - isEqualMemD       : " << counts[5] << "\n";
      std::cout << "count4 - memCmp<>          : " << counts[4] << "\n";
      std::cout << "count3 - strcmp            : " << counts[3] << "\n";
      std::cout << "count2 - memcmp            : " << counts[2] << "\n";
      std::cout << "count1 - quickCmp          : " << counts[1] << "\n";
      std::cout << "count0 - strCmp<>          : " << *counts   << "\n";
      std::cout << "\n" << static_cast<double>(counts[3]) / counts[1] << "\n";
      assert(r1 == r2);

      if (counts[4] < counts[2]) ++memCmp_faster_then_memcmp_count;
      if (counts[4] < counts[1]) ++memCmp_faster_then_quickCmp;
    }
    std::cout << "\nmemCmp_faster_then_memcmp_count: "
              << memCmp_faster_then_memcmp_count  << " / "
              << iterCount_1_ << std::endl;
    std::cout << "memCmp_faster_then_quickCmp: "
              << memCmp_faster_then_quickCmp << " / "
              << iterCount_1_ << std::endl << std::endl;

    const char* const names_001_[]
      = {"strCmp<>          ",
         "quickCmp          ",
         "memcmp            ",
         "strcmp            ",
         "memCmp<>          ",
         "isEqualMemD       ",
         "isEqualMem<8U>    "};
    auto idx_0001_ = 0U;
    for (volatile auto& currAvg : avg) {
      currAvg /= iterCount_1_;
      std::cout << "avg " << names_001_[idx_0001_] << " - #" << idx_0001_ << ": " << currAvg << std::endl;
      ++idx_0001_;
    }
    delete[] mem1, delete[] mem2;

结果:

Static str 使用最合适（并非总是最快）的函数：strCmp<>、isEqualMemD<>、memcmp。

Static str 在可能的情况下使用 isEqualMemD<> 而不是 memcmp，因为它更快；使用 strCmp<> 而不是 strcmp，因为它能够比较多种字符类型；在可能的情况下使用 memcmp 而不是 strCmp<>，因为它更快。

尽管此页面说标准 memcmp 函数在所有架构上都有内在形式，并且我使用 '/Oi' 编译器选项编译了我的测试用例，但测试显示自定义 isEqualMemD<> 稍快（这可能不应该发生）。

3) 关系比较速度测试

测试“小于”运算符的速度。

测试代码:

    const decltype(s_str_01_) s_str_02_ = static_chars_01_;
    const decltype(dstr_01_) dstr_02_ = static_chars_01_;

    volatile auto result__01_ = false;
    time1 = std::chrono::system_clock::now();
    for (size_t testIdx = size_t(); testIdx < 100000U; ++testIdx) {
      result__01_ = s_str_01_ < s_str_02_;
    }
    time2 = std::chrono::system_clock::now();
    counts[1] = (time2 - time1).count();

    time1 = std::chrono::system_clock::now();
    for (size_t testIdx = size_t(); testIdx < 100000U; ++testIdx) {
      result__01_ = dstr_01_ < dstr_02_;
    }
    time2 = std::chrono::system_clock::now();
    counts[2] = (time2 - time1).count();

    time1 = std::chrono::system_clock::now();
    for (size_t testIdx = size_t(); testIdx < 100000U; ++testIdx) {
      result__01_ = strcmp(s_str_01_.c_str(), s_str_02_.c_str()) == 0;
    }
    time2 = std::chrono::system_clock::now();
    counts[3] = (time2 - time1).count();

    volatile auto diff = static_cast<double>(counts[2]) / counts[1];
    std::cout << "\nStatic str. 'operator<' " << diff << " times faster then the dynamic one\n";
    diff = static_cast<double>(counts[3]) / counts[1];
    std::cout << "  and " << diff << " times faster then the 'strcmp'\n";

结果:

4) 大规模分配/释放速度测试

展示了该概念为何诞生 :)

首先，用于执行测试的 PerformanceTester 模块

#ifndef PerformanceTesterH
#define PerformanceTesterH

#include <chrono>
#include <iostream>

class PerformanceTester {

public:
  
  struct TestResults {
    typedef decltype(std::chrono::system_clock::now()) TTimePoint;
    typedef decltype((TTimePoint() - TTimePoint()).count()) TTimeCount;

    void clear() throw() {
      time1 = TTimeCount(), time2 = TTimeCount();
    }

    TTimeCount time1 = TTimeCount(), time2 = TTimeCount();
  };

  // Returns diff. (2/1)
  template <class TFunctor1, class TFunctor2, const bool onScreen = true>
  static double test(TFunctor1& subj1, TFunctor2& subj2,
                     const size_t testCount, TestResults& results)
  {
    results.clear();
    if (!testCount) return 0.0;
    
    auto time1 = TestResults::TTimePoint(), time2 = TestResults::TTimePoint();
    auto testIdx = size_t();
    
    auto testSubj = [&](const bool isSecondSubj) throw() {
      time1 = std::chrono::system_clock::now();
      for (testIdx = size_t(); testIdx < testCount; ++testIdx) {
        switch (isSecondSubj) {
          case true: subj2(); break;
          default: subj1(); // false
        }
      }
      time2 = std::chrono::system_clock::now();

      return (time2 - time1).count();
    };

    results.time1 = testSubj(false);
    results.time2 = testSubj(true);
    const auto diff = static_cast<double>(results.time2) / results.time1;
    
    if (onScreen) {
      auto const potfix = (diff < 1.0) ? "faster" : "slower";
      const auto diffReinterpreted = (diff < 1.0) ? (1.0 / diff) : diff;
      std::cout << "\nSecond is " << diffReinterpreted << " times " << potfix << std::endl;
    }
    return diff;
  }
};

#endif // PerformanceTesterH

测试代码:

struct Funct1__ {
  volatile int operator()() throw() {
    volatile StaticallyBufferedStringLight<> str;
    return 0;
  }
} funct1__;

struct Funct2__ {
   volatile int operator()() throw() {
    std::string str;
    str.reserve(127U);
    return 0;
  }
} funct2__;

PerformanceTester::TestResults results1;
PerformanceTester::test(funct1__, funct2__, 9999999U, results1);

结果:

5) 大规模相等比较速度测试

测试静态字符串哈希码已计算情况下，“不相等”运算符的速度。

测试代码（结尾处的不同符号）

static auto const STR1
  = "cam834mht8xth,a4xh387th,txh87c347837 g387go4 4w78t g34 3g7rgo bvgq7 tgq3874g478g8g oebgbg8 b"
    "cwmc8htcw,o7845mt754cm8w4gcm8w4gc78w4gcw4cw4yc4c4xn34x63gc7sch74s4h5mc7h7cn34xm7xg78gxm7384x";

static auto const STR2
  = "cam834mht8xth,a4xh387th,txh87c347837 g387go4 4w78t g34 3g7rgo bvgq7 tgq3874g478g8g oebgbg8 b"
    "cwmc8htcw,o7845mt754cm8w4gcm8w4gc78w4cw4cw4yc4c4xn34x63gc7sc_74s4h5mc7h7cn34xm7xg78gxm7384x";

struct Funct1__ {

  Funct1__() throw() : str1(STR1), str2(STR2) {
    //// Prehash
    str1.hash();
    str2.hash();
  }

  volatile int operator()() throw() {
    volatile auto result = (str1 != str2);
    return 0;
  }

  StaticallyBufferedStringLight<char, 255U> str1;
  StaticallyBufferedStringLight<char, 255U> str2;
} funct1__;

struct Funct2__ {

  Funct2__() throw() : str1(STR1), str2(STR2) {}

  volatile int operator()() throw() {
    volatile auto result = (str1 != str2);
    return 0;
  }

  std::string str1;
  std::string str2;
} funct2__;

PerformanceTester::TestResults results1;
PerformanceTester::test(funct1__, funct2__, 9999999U, results1);

结果: