优化正则表达式的工具





5.00/5 (5投票s)
用于优化正则表达式的 JavaScript 工具
引言
正则表达式 (regex) 是用于各种编程语言中模式匹配和文本处理的强大工具。它们通过指定匹配字符序列的模式,提供了一种简洁的方式来搜索、提取和修改文本。然而,尽管表达力强,正则表达式模式可能会迅速变得效率低下或过于复杂,从而导致性能问题,尤其是在大规模数据处理中。
正则表达式优化涉及系统地简化和改进正则表达式模式,以提高效率、可读性和可维护性,同时不改变其预期行为。精炼复杂的表达式可以减少由于过度回溯和冗余操作造成的开销,确保更快的执行时间和更低的资源消耗。
本文提出了一种基于对正则表达式形式化结构理解的正则表达式优化方法。通过将正则表达式分解为基本组件——例如字面量、字符类、量词、组和交替——并应用优化技术,我们可以显著地简化和精简正则表达式模式。使用抽象语法树 (AST) 等工具以编程方式表示和操作正则表达式结构,可以进行更深入的分析,为自动化优化技术铺平道路。
以下章节概述了正则表达式优化的核心概念,引入了用于表示正则表达式语法的结构化 BNF (巴科斯-诺尔范式),并描述了将低效模式转换为优化版本的方法。本文旨在为开发人员和自动化工具提供一个稳健的框架,以有效优化和管理复杂的正则表达式。
背景
这是关于分解正则表达式工具的后续文章。
目录
- 摘要
- 引言
- 正则表达式优化的核心概念
- 优化正则表达式
- 优化量词
- 利用字符类
- 优化或条件 (|)
- 利用非捕获组
- 策略性地实施锚点
- 避免贪婪量词
- 最小化回溯
- 谨慎使用先行断言和后行断言
- 先行断言
- 后行断言
- 后行断言的实用示例
- 正则表达式优化工具
- 为什么使用 AST?
- 优化示例
- 优化工具适用于哪些人?
- 进一步改进?
- 结论
- 其他有用的在线工具
- 1. Regex101
- 2. RegExr
- 3. RegexBuddy
- 4. Regex Pal
- 5. Regex Tester
- 6. Regular expression tester
- 参考
- 附录和 JavaScript 源代码下载
- 函数 ASTtoString()
- 函数 ASTtoRegex()
- 函数 parseRegex2AST()
- 函数 optimizeAST()
变更
2024年10月9日。格式化了代码部分并在内容后添加了下载按钮。
正则表达式优化的核心概念
正则表达式优化涉及将正则表达式分解为其基本元素和结构,例如字面字符、字符类、量词和组。这个过程对于必须调试或修改复杂正则表达式模式的开发人员至关重要。它阐明了正则表达式的每个部分的作用以及它如何影响匹配行为。由于正则表达式语法复杂且并非总是合乎逻辑,因此首先建立语法的形式化 BNF 描述对我很有帮助。
正则表达式的 BNF 语法
<regexp> ::= "/" <pattern> "/" [<flags>] <pattern> ::= <concatenation> ("|" <concatenation>)* <concatenation> ::= <element>* <element> ::= <group> | <character-class> | <quantified> | <anchor> | <escaped-character> | <character> <group> ::= "(" <pattern> ")" | "(?<identifier>" <pattern> ")" | "(?:" <pattern> ")" | "(?=" <pattern> ")" | "(?!" <pattern> ")" | "(?<=" <pattern> ")" | "(?<!" <pattern> ")" <character-class> ::= "[" [ "^" ] <character-class-body> "]" <character-class-body> ::= <character-range> | <character>* <character-range> ::= <character> "-" <character> <quantified> ::= <element> <quantifier> [ <lazy-modifier> ] // Separating quantifier and lazy modifier <quantifier> ::= "*" | "+" | "?" | "{" <digits> [ "," <digits> ] "}" <lazy-modifier> ::= "?" // Lazy quantifier modifier <anchor> ::= "^" | "$" | "\b" | "\B" <escaped-character> ::= "\\" <character> <character> ::= any single character except special characters | <escaped-character> <identifier> ::= <letter> (<letter> | <digit> | "_")* <flags> ::= <flag>* <flag> ::= "g" | "i" | "m" | "s" | "u" | "y"
现在更容易看出正则表达式如何分解为更简单的组件。我开发的用于分解的 JavaScript 函数在附录中列出。
主要的分解组包括
- 转义字符是那些前面带有反斜杠 \ 的字符,它会改变后面字符的通常含义。例如,\n 表示换行,\d 匹配任何数字。它们用于将特殊字符作为字面字符包含在正则表达式模式中,或表示特殊的正则表达式函数(如 \b 用于单词边界)。
- 字符集,用方括号 [] 括起来,匹配字符集中的任何一个字符。例如,[a-z] 匹配任何小写字母,而 [^a-z] 匹配任何非小写字母的字符。它们允许正则表达式通过匹配定义字符集中的字符而更灵活和简洁,从而增强字符串中的模式匹配能力。
- 量词决定了前一个元素(如字符、组或字符类)需要出现多少次才能匹配。标准量词包括 *(0 次或更多)、+(1 次或更多)和 ?(0 次或 1 次)。量词对于指定字符串中匹配前一个元素的出现次数至关重要,允许正则表达式模式匹配不同长度的文本。其核心形式是,量词是贪婪的,这意味着它们试图消耗尽可能多的输入进行匹配。其对应的是懒惰的,这意味着它将消耗最少的输入来完成匹配。懒惰通过在上述量词后添加额外的 ? 来表示。
- 花括号用作量词,用于指定模式元素必须出现的次数。它们可以定义固定的次数 {n}、范围 {n,m} 或模式必须出现的最小次数 {n,}。这种类型的量词有助于匹配模式中精确的重复次数,从而精确控制组件的出现。这在匹配特定数据格式(如日期、序列号或代码片段)时特别有用。简单的量词也可以用花括号表示为:* 为 {0,},+ 为 {1,},? 为 {0,1}。
- 交替,由竖线 | 符号表示,作用类似于布尔 OR。它匹配 | 之前或之后的模式。例如,cat|dog 匹配 cat 或 dog。交替允许匹配几个可能的部分之一,使其值得在单个正则表达式模式中包含多个选项,从而增强灵活性和范围。
- 正则表达式中的组将多个字符视为一个单元。它们用括号 () 括起来。组有两种形式:捕获组和非捕获组。
- 捕获组保存由括号内的正则表达式部分匹配的字符串部分。这允许用户从字符串中提取信息或在相同的正则表达式中重用模式的一部分(反向引用)。有两种捕获组:无名组 (...) 和命名组 (?
...)。命名组是对无名组的补充,无名组只能通过它们在正则表达式中的相对位置进行反向引用。命名组允许我们给组一个更具自解释性的名称,并通过该名称进行引用。普通括号 () 内的任何正则表达式都用于捕获。例如,(abc) 捕获序列 abc,而 (? \d{4}) 捕获四个数字作为年份。 - 有五种非捕获组。第一个是 (?:...),它匹配该组,但不保存该组匹配的文本。当您需要分组功能而不需要捕获开销时使用它们。在组前加上 ?:,例如 (?:abc) 以进行分组而不捕获。它将“abc”视为一个单元,而不记住匹配。
第二和第三个是两个先行断言组(正向先行断言和负向先行断言)。正向先行断言 (?= ... ) 匹配主表达式后面的组,但不将其包含在结果中。负向先行断言 (?! ... ) 断言指定的组不紧跟在主表达式之后。
第四和第五个是两个后行断言组(正向后行断言和负向后行断言)。正向后行断言 (?<= ... ) 断言指定的组紧跟在主表达式之前并且必须匹配,但它不消耗任何字符。负向后行断言 (?
每个组件在构建用于模式匹配和文本处理任务的复杂高效正则表达式中都至关重要。
优化正则表达式
正则表达式是模式匹配和文本处理的强大工具,但如果构造不当,它们会很快变得效率低下。优化正则表达式可以提高性能,并增强其可读性和可维护性。以下是一些优化正则表达式的基本最佳实践和技巧。首先,我们来看看正则表达式的语法。从宏观角度来看,您可以将其分解为以下列表。
符号 | 命名 |
^ | 锚点开头 |
$ | 锚点结尾 |
[…] | 字符类 |
\d,\D,\w,|W,\s,\S,\b,\B | 预定义类 |
Abc… | 字面量 |
+,*,?,{…},+?,*?,??,{…}? | 量词 |
(…),(? | 捕获组 |
(?:…),(?=…),(?!...),(?<=…),(? | 非捕获组 |
Cat|dog|bird | 交替 |
让我们从量词开始。
优化量词
量词出现在
<pattern>+ meaning one or more occurrences of the pattern <pattern>* meaning zero or more occurrences of the pattern <pattern>? meaning zero of one occurrence of the pattern.
量词也可以是括号量词。例如 {n} 或 {n,} 或 {n,m}。括号量词也可以后跟一个可选的 ? 表示惰性括号。
<pattern>{n} n occurrence of the <pattern> <pattern> {n,} n or more occurrences of the <pattern> <pattern> {n,m} from n to m occurrence of the <pattern>
通常可以通过简化重复的处理方式来提高效率。
Use + instead of {1,} to match one or more occurrences. Use * instead of {0,} for zero or more occurrences. Replace {0,1} with a ? to match zero or one occurrence. Simplify {n,n} to {n}
正如我所提到的,量词后面可以跟惰性量词符号 ?。例如:
<pattern>+?, <pattern>*?, <pattern>?? Or pattern>{…}?
在某些条件下,我们还可以使用惰性量词优化正则表达式。
如果我们遇到
<pattern>+? → <pattern>
例如: \d+? → \d
您可以根据需要将此优化扩展到其他惰性量词,例如
<pattern>*? → potentially equivalent to an empty pattern in some cases. <pattern>?? → reduces to just the pattern itself if applicable.
然而,这些优化需要根据正则表达式的上下文谨慎应用。这里有一个经典的例子,它不起作用。考虑正则表达式
/<.*?>/
这个正则表达式通常用于匹配像
.*? 惰性地匹配零个或多个字符,它匹配尽可能少的字符直到找到闭合的 >。例如: <div><span>Hello</span></div>。 匹配 <div>、<span>、</span>、</div>
这里,*? 并不简单地匹配一个空字符串;相反,它匹配 HTML 标签内部最短的字符序列(在 < 和 > 之间)。如果我们没有使用惰性量词 ?,那么它将匹配整个字符串:<div><span></span></div> 这不是你想要的。
在某些特定场景下,当惰性量词紧随模式 *?(即
- 当模式位于末尾并后跟一个锚点或另一个强制匹配的字符时。如果 *? 模式恰好位于锚点 (^, $, \b) 或强制文字之前,并且它不具贡献性。可以移除该模式。惰性量词允许匹配零个模式出现,锚点或文字将确定匹配。
示例 1: a*?$ 由于 *? 允许零匹配,$ 强制字符串的结尾。我们可以将其简化为:$ 在这种情况下,模式 a*? 可以安全地移除,因为无论是否有任何“a”字符,匹配都将始终在字符串的末尾 ($) 结束。
示例 2: \w*?\b 同样,我们可以简化;由于 \w*? 可以匹配零个字符,因此单词边界将始终得到满足。这简化为:\b 这里的 \w*? 是多余的,因为即使没有前面的单词字符,单词边界也会匹配。 - 当模式是更大的可选序列的一部分时。如果 *? 模式是已经可选的更大序列的一部分(例如,由于交替或另一个量词),则可以将其移除。
示例: (a|b*?)。可以简化为 (a|) 并简化为 a?。这里,b*? 可以安全地移除,因为它没有贡献任何与交替中的空替代方案不同的东西。 - 当模式在已经可选的组中重复时,如果 *? 应用于已用量词(如 ? 或 {0,})重复的组中,则 *? 可能冗余并可以移除。
示例 1: (a*?)? 外层的 ? 已经将该组标记为可选,因此 *? 是多余的,正则表达式简化为 (a)? 或直接 a?
示例 2: (a*)*? 其中外层的 *? 使整个组冗余,因为 a* 已经允许零匹配。这简化为 a* - 与空模式交替。如果 *? 出现在与空模式或另一个允许零匹配的模式交替 (|) 中,则可以移除 *?。
示例: (a*?|b),可以简化为 (a|b)。
当简化像 预定义的字符类可以显著简化您的表达式。 在使用交替 (|) 时,将最有可能的条件放在前面可以节省时间,特别是如果后续条件是先前条件的子集。此外,通过提取公共前缀或后缀来简化交替可以降低复杂性。例如,我们对 OR 条件执行以下优化。 转换模式,例如 当您不需要保存捕获的组以供以后使用时,非捕获组 (?:...) 更可取。这种方法减少了开销,因为正则表达式引擎不需要跟踪这些组的内容。有时,非捕获分组是不必要的,或者不会改变行为。例如,当单个模式(字面量、字符类或预定义转义)紧跟在非捕获组之后时,可以移除非捕获组。此外,如果非捕获组前面是 ^ (ANHOR_START) 或 $ (ANCHOR_END),则非捕获组是多余的。 使用锚点,例如 ^(ANCHOR_START)表示字符串的开头,$(ANCHOR_END)表示字符串的结尾,可以防止在模式已知是固定的或仅限于特定位置时,在整个字符串中进行不必要的匹配尝试。 贪婪量词 (*, +) 尝试匹配尽可能多的文本,这可能导致性能问题。选择它们的惰性对应物 (*?, +?),它们匹配最小的可能字符串,从而提高效率。然而,这取决于您要匹配的内容。 为了防止过度回溯(这会减慢您的正则表达式),请使用原子组 (?>...) 和独占量词,如 *+ 或 ++。这些构造告诉正则表达式引擎避免重新访问先前匹配的字符,从而提高复杂模式的性能。 由于 JavaScript 不支持原子组或独占量词,您通常需要依赖其他方法来防止低效的回溯。 例如,如果您试图匹配以特定后缀结尾的序列,而不是编写类似 .*suffix 的内容(这可能导致过度回溯),您可以指定后缀之前不应匹配的内容,例如 [^ ]*suffix(假设后缀前面有空格)。 先行断言和后行断言是正则表达式中进行断言的强大工具,但应谨慎使用。如果过度使用,它们可能会影响性能。仅在需要断言条件而不消耗字符时使用它们。 先行断言和后行断言是高级正则表达式特性,允许您指定匹配的额外条件,而无需将这些条件包含在匹配本身中。这些“零宽度断言”不消耗字符串上的任何字符;它们只是断言是否可能匹配。 正向先行断言 (?=) 断言字符串中当前位置紧随其后的内容必须与括号内指定的模式匹配,但不将其包含在匹配中。 示例:X(?=Y) 仅当 X 后跟 Y 时才匹配 X。在字符串 XYP 中,X 将匹配,因为它后跟 Y。 通过移除冗余结构来优化正向先行断言。 负向先行断言 (?! ) 断言字符串中当前位置紧随其后的内容不得与括号内指定的模式匹配。 示例:X(?!Y) 仅当 X 后不跟 Y 时才匹配 X。在字符串 XZP 中,X 将匹配,因为它后不跟 Y。 负向先行断言 (?!...) 有时可以更有效地重写,尤其是在应用于简单字符类或模式时。例如: 此模式匹配 b,但仅当它前面没有 a 时。根据上下文,这通常可以用否定字符类替换以简化。 正向后行断言 (?<=) 断言字符串中当前位置紧靠其前的内容必须与括号内指定的模式匹配,但不将其包含在匹配中。 示例:(?<=Y)X 仅当 Y 在 X 之前时才匹配 X。在字符串 YXP 中,X 将匹配,因为 Y 在其之前。 负向后行断言 (?<!) 断言字符串中当前位置紧靠其前的内容不得与括号内指定的模式匹配。 示例:(?
考虑一个场景,您需要查找前面没有“Refund”一词的美元金额。正则表达式可以使用负向后行断言来确保我们匹配的金额不是退款的一部分:/(?
先行断言和后行断言对于需要检查子字符串上下文而无需将该上下文包含在最终匹配结果中的条件非常强大。它们通常用于数据验证、解析复杂的字符串格式和文本行中的条件处理。 实现任何正则表达式的优化都可能具有挑战性。有些可以在不了解正则表达式目的的情况下完成。然而,大多数都基于您希望如何从正则表达式中提取信息。例如,将捕获组更改为非捕获组要求您不打算以后使用该组。优化器不可能知道这一点;因此,只能建议任何您不打算引用的组都可以更改为非捕获组。 现在,所有优化潜力都已到位,我们可以将注意力转向工具的实际编码。起初,我计划分析正则表达式以找到可以转换为更简单版本或冗余的模式。然而,我很快就遇到了这种方法的局限性。相反,我意识到要实现目标,我需要为正则表达式构建一个抽象语法树 (AST),然后使用编译器技术重新排列、简化和删除正则表达式中的冗余元素。 抽象语法树 (AST) 表示源代码或其他形式语言(例如正则表达式或数学表达式)的句法结构。AST 抽象了实际使用的语法(如括号、方括号或逗号),转而关注代码的结构和逻辑元素。树中的每个节点都代表代码中出现的一个构造。 通用编程语言 AST 的关键特征是节点、边、抽象和层次结构 示例 考虑在无数关于编译器的书籍中找到的一个简单的算术表达式 2 + 3 * 4 此表达式的 AST 将如下所示 + 纯粹主义者可能会争辩说,对于我简单的正则表达式项目来说,它更像一个解析树而不是 AST? AST 经常与解析树(或具体语法树)混淆,但它们是不同的。 如果您的结构保留了所有细节(例如区分 + 和 +?),那么我们正在处理一个解析树。如果它简化或优化了此类情况,则更准确地称为 AST。因此,我们有一个解析树,我们将其优化为 AST。为简单起见,我将其称为 AST。 对于我们的正则表达式项目,我们有像这样的节点(下面只显示了一个子集) ANCHOR_START ^ 对于我们的项目,我们需要四个不同的函数。(JavaScript 代码在附录中) 1. 将正则表达式转换为 AST。ParseRegextoAST() 2. 将其转换回正则表达式。ASTtoRegex() 3. 优化正则表达式。OptimizeAST() 4. 解码 AST 以进行调试。 ASTtoString() 所有四个不同的函数都包含递归函数,用于下降到树中并对其进行优化。 在优化正则表达式时,识别不同技能水平用户的不同需求非常重要。优化器通过自动化最佳实践和避免常见的低效率为初学者和中级用户提供了显著优势。新手通常会编写容易过度回溯或冗余的模式,而中级用户可能会忽略细微的优化,例如简化量词或删除不必要的捕获组。优化器有助于弥补这一差距,确保性能改进,减少错误,并通过演示更高效的模式作为学习工具。相比之下,高级用户通常在编写高度优化的正则表达式方面与优化器一样熟练。他们对正则表达式引擎的更深入理解使他们能够根据特定上下文进行微调优化,平衡性能与可读性和可维护性。虽然高级用户可能不总是需要自动化优化,但他们仍然可以从优化器的一致性和便利性中受益,尤其是在发现重复或容易被忽略的改进方面。 使用正则表达式 /[a-zA-Z_][0-9a-zA-Z_]*/g 优化细节如下所示。 完成这个项目后,我意识到我正在使用简单的窥孔编译器技术来优化正则表达式。缺点是每个优化步骤都必须编码。下一步是使用基于规则的优化来查看正则表达式结构的全局分析,我们在此查看可以转换为更优化版本的模式或模式序列,使用基于规则的方法。这将允许添加更多的优化技术,而无需手动编写每个新的优化。然而,这将是未来的项目,正则表达式优化工具的第2版。 所列的四个工具是优化正则表达式的关键部分,通过分解和优化,不仅有助于调试和开发,还使正则表达式达到最佳版本,并提高编写更高效、无错误的正则表达式的能力。优化工具集在这一教育过程中发挥着重要作用。作者的正则表达式工具是提到的第六个工具。它具有独特的功能,允许您同时测试两个正则表达式,分解正则表达式,或优化正则表达式。这在尝试为给定问题构建正则表达式时变得有用,然后动态地,可以构建表达式以连续匹配更多所需的模式。 有几个在线工具可用于测试和调试正则表达式。以下是一些广泛使用的正则表达式工具及其相关详细信息,您可以在文章中引用它们 1. Regex101 2. RegExr 3. RegexBuddy 4. Regex Pal 5. Regex Tester 6. Regular expression tester 本附录中的所有代码片段均基于 JavaScript,也可以下载。 利用字符类
优化 OR 条件 (|)
abc|abd → ab[cd]
abc|ab → ab[c]?
[0-9]|[a-z]|[A-Z]|[_] → \w
a|b|c → [abc]
a|b| → [ab]?
利用非捕获组
(?:a) → a
(?:a|b|c)? → [abc]?
^(?:abc) → ^abc
(?:abc)$ → abc$
(?:abc)+ → abc+
(?:[a-z])+ → [a-z]+
策略性地实施锚点
避免贪婪量词
最小化回溯
谨慎使用先行断言和后行断言
先行断言
(?=abc)abc → abc
(?:abc|def)(?=ghi) → (?:abc|def)
(?!a)b → [^a]b
后行断言
后行断言的实用示例
正则表达式优化工具
/ \
2 *
/ \
3 4
为什么使用 AST?
ANCHOR_END $
ALTERNATION |
CAPTURING GROUP ( … ), (?
NON-CAPTURING Group (?: … ), 以及其他
LITERAL. A…Z , a…z, 0… 9
QUANTIFIER. +,*,?, {…}优化工具适用于哪些人?
优化示例
您可以将其优化为 /[A-Z_a-z]\w*/Original 1st Regex:
/[a-zA-Z_][0-9a-zA-Z_]*/g
Prior Optimization:
1: ROOT: Child[2]
2: CHAR_CLASS: Value=[a-zA-Z_], Parent=1
3: QUANTIFIER: Value=*: Child[1], Parent=1
4: CHAR_CLASS: Value=[0-9a-zA-Z_], Parent=3
Suggested optimization:
[a-zA-Z_] to [A-Z_a-z] to simplify regex
Replaced [0-9a-zA-Z_] with \w to simplify regex
New Optimize regular expression:
/[A-Z_a-z]\w*/
After Optimization
1: ROOT: Child[2]
2: CHAR_CLASS: Value=[A-Z_a-z], Parent=1
3: QUANTIFIER: Value=*: Child[1], Parent=1
4: PREDEFINED_CLASS: Value=\w, Parent=3
进一步改进?
结论
其他有用的在线工具
参考
附录
函数 ASTtoString()
// Helper function to convert an AST node to a string representation
function ASTtoString(rootNode) {
if (!rootNode) return 'Invalid AST';
function ASTtoStringInternal(node, indent = 0, line = 1, parentLine = null) {
const indentation = ' '.repeat(indent);
let result = [];
// Add line number, node type, and parent information to the result
result.push(`${line}: ${indentation}${node.type}`);
if (node?.value) result.push(`: Value=${node.value}`);
const childCount = node.children.length;
if (childCount > 0) result.push(`: Child[${childCount}]`);
if (parentLine !== null) result.push(`, Parent=${parentLine}`);
result.push('\n');
let currentLine = line + 1;
// Process each child recursively
for (const child of node.children) {
const childResult = ASTtoStringInternal(child, indent + 1, currentLine, line);
result.push(childResult.result);
currentLine = childResult.line;
}
return { result: result.join(''), line: currentLine };
}
return ASTtoStringInternal(rootNode).result;
}
函数 ASTtoRegex()
// Convert AST to a regex string
// Notice we don't need the parent link
function ASTtoRegex(node) {
let regex = '';
switch (node.type) {
case 'ROOT':
case 'GROUP':
case 'NON_CAPTURING':
case 'CAPTURING':
case 'NAMED_CAPTURING':
case 'LOOKAHEAD':
case 'NEGATIVE_LOOKAHEAD':
case 'LOOKBEHIND':
case 'NEGATIVE_LOOKBEHIND':
// Handle different types of groupings with specific regex syntax
const groupPrefixes = {
'NON_CAPTURING': '(?:',
'NAMED_CAPTURING': `(?<${node.value}>`,
'CAPTURING': '(',
'LOOKAHEAD': '(?=',
'NEGATIVE_LOOKAHEAD': '(?!',
'LOOKBEHIND': '(?<=',
'NEGATIVE_LOOKBEHIND': '(?<!'
}
if (groupPrefixes[node.type])
regex += groupPrefixes[node.type]
// Process each child
node.children.forEach((child) =>
regex += ASTtoRegex(child)
)
// Close the group if necessary
if (['NON_CAPTURING', 'CAPTURING', 'NAMED_CAPTURING','LOOKAHEAD', 'NEGATIVE_LOOKAHEAD', 'LOOKBEHIND', 'NEGATIVE_LOOKBEHIND'].includes(node.type))
regex += ')'
break
case 'ALTERNATION':
// If alternation is inside a group, enclose it in parentheses
const needsGrouping = node.parent && node.parent.type == 'GROUP'
if (needsGrouping) regex += '('
node.children.forEach((child, childIndex) => {
if (childIndex > 0)
regex += '|'
regex += ASTtoRegex(child)
})
if (needsGrouping) regex += ')'
break
case 'QUANTIFIER':
// Assume the quantifier applies to the last segment of the regex
if (node.children.length > 0)
regex += ASTtoRegex(node.children[0])
regex += node.value
break
case 'CHAR_CLASS':
case 'PREDEFINED_CLASS':
case 'LITERAL':
case 'ANCHOR_START':
case 'ANCHOR_END':
case 'ESCAPED_CHAR':
regex += node.value
break
default:
throw new Error(`Unknown node type: ${node.type} (Node value: ${node.value})`)
}
return regex
}
函数 parseRegextoAST()
// Here is the Regex node for parsing and optimizing regex
// type is 'ANCHOR_START', 'ANCHOR_BEGIN', 'PREDEFINED_CLASS', 'ESCAPED_CHAR',
// 'CHAR_CLASS', 'CAPTURING', 'NON_CAPTURING', 'NAMED_CAPTURING',
// 'LOOKAHEAD', 'NEGATIVE_LOOKAHAED', 'LOOKBEHIND", 'NEGATIVE_LOOKBEHIND',
// 'ALTERNATION', 'GROUP', 'LITERAL', 'QUANTIFIER', 'ROOT'
// value is the value or undefined or null. fx. QUANTIFIERS can have the vlaue
// of '*",'+','?','*?','+?','??'
// for CAPTURING it is the grouping number
// for LITERAL it is the literal itself (character)
// for ANCHOR_START it is '^', ANCHOR_END it is '$'
// for PREDEFINED+CLASS it is one of theses: 'b', 'B', 'd', 'D', 'w', 'W', 's', 'S'
// for the other captures it is the string version e,g, '(?:' ...
//
class RegexNode {
constructor(type, value, parent=null) {
this.type = type; // Node type
this.value = value; // Node value
this.parent= parent; // Parent link
this.children = []; // Siblings
}
addChild(node) {
node.parent = this; // Set the parent link
this.children.push(node);
}
}
// Parse a Regex into an AST.
// The regex can be the regex itself or the string representation of it.
function parseRegextoAST(regex) {
const regexString = regex instanceof RegExp ? regex.source : regex
function parseRegex2AST(regexString, capturingGroupCount = { count: 0 }, parent = null) {
const rootNode = new RegexNode('ROOT', null, parent) // Set parent as null for root
let i = 0
function handleQuantifier(rootNode, regex, currentIndex, quantifier) {
let isLazy = (currentIndex + quantifier.length < regex.length && regex[currentIndex + quantifier.length] === '?')
let quantifierValue = quantifier + (isLazy ? '?' : '')
const lastNode = rootNode.children.pop()
const quantifierNode = new RegexNode('QUANTIFIER', quantifierValue, rootNode)
quantifierNode.addChild(lastNode)
rootNode.addChild(quantifierNode)
}
function handleGroup(type, regex, startIndex, name = '') {
let depth = 1
let i = startIndex
let groupNode = new RegexNode(type, name, rootNode)
while (i < regex.length && depth > 0) {
if (regex[i] === '(' && !(regex[i - 1] === '\\')) depth++
if (regex[i] === ')' && !(regex[i - 1] === '\\')) depth--
i++
}
let content = regex.substring(startIndex, i - 1)
let parsedContent = parseRegex2AST(content, capturingGroupCount, groupNode)
if (parsedContent.type === 'ROOT') parsedContent.type = 'GROUP'
if (parsedContent.type === 'GROUP') {
if (parsedContent.children.length === 1)
groupNode.addChild(parsedContent.children[0])
else if (parsedContent.children.length != 0)
groupNode.addChild(parsedContent)
} else
groupNode.addChild(parsedContent)
return { node: groupNode, newPosition: i }
}
while (i < regexString.length) {
const char = regexString[i]
switch (char) {
case '^':
rootNode.addChild(new RegexNode('ANCHOR_START', '^', rootNode))
break
case '$':
rootNode.addChild(new RegexNode('ANCHOR_END', '$', rootNode))
break
case '\\':
i++
if (['b', 'B', 'd', 'D', 'w', 'W', 's', 'S'].includes(regexString[i]))
rootNode.addChild(new RegexNode('PREDEFINED_CLASS', '\\' + regexString[i], rootNode))
else
rootNode.addChild(new RegexNode('ESCAPED_CHAR', regexString[i], rootNode))
break
case '[':
let endIdx = regexString.indexOf(']', i + 1)
rootNode.addChild(new RegexNode('CHAR_CLASS', regexString.substring(i, endIdx + 1), rootNode))
i = endIdx
break
case '(':
let result
if (regexString.substring(i, i + 3) === '(?:') {
i += 3
result = handleGroup('NON_CAPTURING', regexString, i)
} else if (regexString.substring(i, i + 3) === '(?<' && regexString.indexOf('>', i) !== -1) {
let nameEndIdx = regexString.indexOf('>', i)
let name = regexString.substring(i + 3, nameEndIdx)
i = nameEndIdx + 1
result = handleGroup('NAMED_CAPTURING', regexString, i, name)
} else if (regexString.substring(i, i + 3) === '(?=' || regexString.substring(i, i + 3) === '(?!') {
let type = regexString.substring(i + 2, i + 3) === '=' ? 'LOOKAHEAD' : 'NEGATIVE_LOOKAHEAD'
i += 3
result = handleGroup(type, regexString, i)
} else if (regexString.substring(i, i + 4) === '(?<=' || regexString.substring(i, i + 4) === '(?<!') {
let type = regexString.substring(i + 3, i + 4) === '=' ? 'LOOKBEHIND' : 'NEGATIVE_LOOKBEHIND'
i += 4
result = handleGroup(type, regexString, i)
} else {
i++
let currentCaptureCount = ++capturingGroupCount.count
result = handleGroup('CAPTURING', regexString, i)
result.node.value = currentCaptureCount
}
rootNode.addChild(result.node)
i = result.newPosition - 1
break
case '|':
let endIndex
if (rootNode.type !== 'ALTERNATION') {
const altNode = new RegexNode('ALTERNATION', null, rootNode)
const leftSideGroup = new RegexNode('GROUP', null, rootNode)
leftSideGroup.children = [...rootNode.children]
rootNode.children = []
altNode.addChild(leftSideGroup)
rootNode.addChild(altNode)
endIndex = regexString.length
const alternationContent = regexString.substring(i + 1)
const rightSideAST = parseRegex2AST(alternationContent, capturingGroupCount, altNode)
if (rightSideAST.children[0] && rightSideAST.children[0].type === 'ALTERNATION')
rightSideAST.children[0].children.forEach(child => altNode.addChild(child))
else {
const rightSideGroup = new RegexNode('GROUP', null, altNode)
rightSideGroup.children = [...rightSideAST.children]
altNode.addChild(rightSideGroup)
}
} else {
endIndex = regexString.length
const alternationContent = regexString.substring(i + 1)
const additionalAST = parseRegex2AST(alternationContent, capturingGroupCount, rootNode)
if (additionalAST.children[0] && additionalAST.children[0].type === 'ALTERNATION')
additionalAST.children[0].children.forEach(child => rootNode.addChild(child))
else {
const additionalGroup = new RegexNode('GROUP', null, rootNode)
additionalGroup.children = [...additionalAST.children]
rootNode.addChild(additionalGroup)
}
}
i = endIndex - 1
break
case '*':
case '+':
case '?':
handleQuantifier(rootNode, regexString, i, char)
if (regexString[i + 1] === '?')
i++
break
case '{':
let closeBraceIndex = regexString.indexOf('}', i)
if (closeBraceIndex !== -1) {
handleQuantifier(rootNode, regexString, i, regexString.substring(i, closeBraceIndex + 1))
if (regexString[closeBraceIndex + 1] === '?')
++closeBraceIndex
i = closeBraceIndex
}
break
default:
rootNode.addChild(new RegexNode('LITERAL', char, rootNode))
break
}
i++
}
return rootNode
}
function combineLiteralNodes(astNode) {
for (let i = 0; i < astNode.children.length; i++) {
let node = astNode.children[i]
if (node.type === 'LITERAL') {
while (i + 1 < astNode.children.length && astNode.children[i + 1].type === 'LITERAL') {
node.value += astNode.children[i + 1].value
astNode.children.splice(i + 1, 1)
}
}
if (node.children && node.children.length > 0)
combineLiteralNodes(node)
}
}
function removeUnnecessaryGroups(astNode) {
for (let i = 0; i < astNode.children.length; i++) {
let node = astNode.children[i]
if (node.type === 'GROUP' && node.value == null && node.children.length === 1) {
let child = node.children[0]
astNode.children[i] = child
child.parent = astNode
}
if (node.children && node.children.length > 0)
removeUnnecessaryGroups(node)
}
}
let ast = parseRegex2AST(regexString)
combineLiteralNodes(ast)
removeUnnecessaryGroups(ast)
return ast
}
函数 optimizeAST()// Optimize AST
function optimizeAST(node, changes = new Set()) {
let changeDesc = '';
// Find the next sibling or return null if not found or end of siblings
function nextSibling(currentNode) {
if (!currentNode.parent)
return null; // No parent means no siblings
const siblings = currentNode.parent.children; // Get all siblings (children of the parent)
const index = siblings.indexOf(currentNode); // Find the index of the current node
if (index >= 0 && index < siblings.length - 1)
return siblings[index + 1]; // Return the next sibling if it exists
return null; // No next sibling (end of siblings or node not found)
}
// Utility function to find the previous sibling of the current node
function previousSibling(node) {
if (!node.parent)
return null; // No parent means no siblings
const siblings = node.parent.children; // Get all siblings from the parent
const index = siblings.indexOf(node); // Find the index of the current node
if (index > 0)
return siblings[index - 1]; // Return the previous sibling if it exists
return null; // No previous sibling if the node is the first child
}
// Compare two nodes structurer
function CompareNodes(node1, node2) {
// Step 1: Check if both nodes are null (both are identical in this case)
if (!node1 && !node2) return true;
// Step 2: If one is null and the other is not, they are not similar
if (!node1 || !node2) return false;
// Step 3: Compare the current nodes (type and value)
if (node1.type !== node2.type || node1.value !== node2.value)
return false;
// Step 4: Compare the number of children
if (node1.children.length !== node2.children.length)
return false;
// Step 5: Recursively compare all children
for (let i = 0; i < node1.children.length; i++) {
if (!CompareNodes(node1.children[i], node2.children[i]))
return false; // If any child comparison fails, the entire structure is different
}
// Step 6: If all checks passed, the nodes and their children are similar
return true;
}
function normalizeCharacterClass(value) {
function preprocessCharacterClass(value) {
// [] has been strip before calling
// Replace shorthand character classes with their expanded forms or remove if redundant
const hasWordChar = /\\w/.test(content);
const hasNonWhitespace = /\\S/.test(content);
const hasDigit = /\\d/.test(content);
// Remove redundant escape sequences and deduplicate
content = content.replace(/(\\w)+/g, hasWordChar ? '\\w' : '')
.replace(/(\\S)+/g, hasNonWhitespace ? '\\S' : '')
.replace(/(\\d)+/g, hasDigit ? '\\d' : '');
if (hasNonWhitespace)
// If \S is present, remove \d, \w, and potentially more as they are redundant
content = content.replace(/\\d|\\w/g, '');
else if (hasWordChar)
// If \w is present, remove \d as it's redundant
content = content.replace(/\\d/g, '');
// Replace shorthand character classes with their expanded forms
content = content.replace(/\\d/g, '0-9');
content = content.replace(/\\w/g, 'a-zA-Z0-9_');
content = content.replace(/\\D/g, '^0-9');
content = content.replace(/\\W/g, '^a-zA-Z0-9_');
// Convert \D and \W within character classes to their negated \d and \w forms
content = content.replace(/\[\\D\]/g, '^0-9');
content = content.replace(/\[\\W\]/g, '^a-zA-Z0-9_');
content = content.replace(/\[\\s\]/g, ' \t\n\r\f\v');
content = content.replace(/\[\\S\]/g, '^ \t\n\r\f\v');
return content;
}
function expandCharacterRange(range) {
const result = [];
const start = range.charCodeAt(0);
const end = range.charCodeAt(2);
for (let i = start; i <= end; i++) {
result.push(String.fromCharCode(i));
}
return result;
}
function postProcessCharacterSet(characters) {
const digitRange = '0123456789';
const wordRange = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_';
const spaceRange = ' \t\n\r\v\f';
let charSet = new Set(characters);
// Helper function to check coverage of range
function isFullRangeCovered(range) {
return [...range].every(char => charSet.has(char));
}
// Helper function to create ranges from consecutive characters
function createCharacterRanges(chars) {
let ranges = [];
let lastChar = chars[0];
let rangeStart = lastChar;
for (let i = 1; i < chars.length; i++) {
let currentChar = chars[i];
if (currentChar.charCodeAt(0) !== lastChar.charCodeAt(0) + 1) {
if (rangeStart === lastChar)
ranges.push(rangeStart);
else
ranges.push(`${rangeStart}-${lastChar}`);
rangeStart = currentChar;
}
lastChar = currentChar;
}
if (rangeStart === lastChar)
ranges.push(rangeStart);
else
ranges.push(`${rangeStart}-${lastChar}`);
return ranges;
}
let result = [];
let processedChars = new Set();
// Check and process space range
if (isFullRangeCovered(spaceRange)) {
result.push('\\s');
[...spaceRange].forEach(char => processedChars.add(char));
}
// Check and process word range
if (isFullRangeCovered(wordRange)) {
result.push('\\w');
[...wordRange].forEach(char => processedChars.add(char));
}
else
{ // Check and process digit range
if (isFullRangeCovered(digitRange)) {
result.push('\\d');
[...digitRange].forEach(char => processedChars.add(char));
}
}
// Collect unprocessed characters
const remainingChars = [...charSet].filter(char => !processedChars.has(char));
if (remainingChars.length > 0) {
remainingChars.sort(); // Sort to ensure correct range detection
let ranges = createCharacterRanges(remainingChars);
result.push(ranges.join(''));
}
return result.join('');
}
// Remove the brackets for easier processing
let content = value.replace(/^\[|\]$/g, ''); // Remove the outer brackets
content = preprocessCharacterClass(content); // Preprocessing
const isNegated = value.startsWith('^');
let characters = new Set();
// Step 1: Capture valid ranges like a-z, 0-9
let ranges = content.match(/[^\\]-[^\\]/g) || [];
// Step 2: Replace the matched ranges with placeholders, then capture the remaining characters and escaped sequences
let withoutRanges = content.replace(/[^\\]-[^\\]/g, ''); // Remove ranges temporarily
let remainingParts = withoutRanges.match(/\\.|[^\\-]+|-/g) || [];
// Step 3: Combine the ranges and remaining parts back into the result
let parts = [...ranges, ...remainingParts];
parts.forEach(part => {
if (part.length === 3 && part[1] === '-' && !part.startsWith('\\')) {
// Correctly handle character ranges
expandCharacterRange(part).forEach(char => characters.add(char));
} else if (part === '\\w') {
// Handle \w by adding all alphanumeric characters and underscore
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_'.split('').forEach(char => characters.add(char));
} else {
// Add individual characters and other escaped characters
part.split('').forEach(char => characters.add(char));
}
});
// Convert set to sorted string and the postprocessing
let normalized = postProcessCharacterSet(Array.from(characters).sort().join(''));
return { normalized, isNegated };
}
// Traverse each child and optimize recursively
node.children.forEach(child => optimizeAST(child, changes));
switch (node.type) {
case 'CHAR_CLASS': // Optimize [] class
const { normalized, isNegated } = normalizeCharacterClass(node.value);
switch (normalized) {
case '\\d':
case '0123456789':
node.type = 'PREDEFINED_CLASS';
changeDesc = `Replaced ${node.value} with`;
node.value = isNegated ? '\\D' : '\\d';
changes.add(`${changeDesc} ${node.value} to simplify regex`);
break;
case '\\w':
case 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_':
node.type = 'PREDEFINED_CLASS';
changeDesc = `Replaced ${node.value} with`;
node.value = isNegated ? '\\W' : '\\w';
changes.add(`${changeDesc} ${node.value} to simplify regex`);
break;
case '\\s':
case ' \t\n\r\f\v':
node.type = 'PREDEFINED_CLASS';
changeDesc = `Replaced ${node.value} with`;
node.value = isNegated ? '\\S' : '\\s';
changes.add(`${changeDesc} ${node.value} to simplify regex`);
break;
default:
let content = node.value;
if (content.replace(/^\[|\]$/g, '') != normalized) {
changes.add(`${node.value} to [${normalized}] to simplify regex`);
node.value = '[' + normalized + ']';
}
}
break;
case 'QUANTIFIER': // Optimize quantifiers
let optimizationDone = false; // Flag to track if optimization was performed
// Helper function to optimize bracket quantifiers like {n}, {n,}, {n,m}
// {0,} -> *, {1,} -> +, {0,1} -> ?, {n.n} -> {n}, {n}? -> {n}
function optimizeBracketQuantifier(quantifier) {
const regex = /\{(\d+)(?:,(\d*))?\}(\?)?/;
const match = regex.exec(quantifier);
let optimizedQuantifier = quantifier;
let changeDesc = '';
if (match) {
const n = parseInt(match[1], 10);
const m = match[2] ? parseInt(match[2], 10) : null;
const isLazy = !!match[3];
if (m === null) {
if (match[0].includes(',')) {
if (n === 0) {
optimizedQuantifier = isLazy ? '*?' : '*';
changeDesc = `Optimized {0,} to ${optimizedQuantifier}`;
} else if (n === 1) {
optimizedQuantifier = isLazy ? '+?' : '+';
changeDesc = `Optimized {1,} to ${optimizedQuantifier}`;
}
} else {
optimizedQuantifier = `{${n}}`;
if(isLazy)
changeDesc = `Simplified {${n}}${isLazy ? '?' : ''} to {${n}}`;
}
} else {
if (n === m) {
optimizedQuantifier = `{${n}}`;
changeDesc = `Simplified {${n},${m}}${isLazy ? '?' : ''} to {${n}}`;
} else if (n === 0 && m === 1) {
optimizedQuantifier = isLazy ? '??' : '?';
changeDesc = `Optimized {0,1} to ${optimizedQuantifier}`;
} else {
optimizedQuantifier = `{${n},${m}}${isLazy ? '?' : ''}`;
changeDesc = `Retained quantifier {${n},${m}}${isLazy ? '?' : ''}`;
}
}
}
return { optimizedQuantifier, changeDesc };
}
if (node.children.length === 0) {// Empty quantifier
// Now, remove the empty quantifier node from the parent's children
node.parent.children = node.parent.children.filter(child => child !== node);
// Log the change for debugging or tracking purposes
changes.add(`Removed empty Quantifier`);
break;
}
let bracket = optimizeBracketQuantifier(node.value);
node.value = bracket.optimizedQuantifier;
if (bracket.changeDesc !== '') changes.add(bracket.changeDesc);
let isLazy = node.value.slice(-1) === '?' && node.value.slice(-2, -1) !== '?';
let baseValue = /*isLazy ? node.value.slice(0, -1) :*/ node.value;
let newQuantifier = '';
changeDesc = '';
// 1. Handle Simple Optimizations First
switch (baseValue) {
case '{1}':
case '{1}?':
// {1} or {1}? means exactly one repetition, so it's redundant
newQuantifier = ''; // The quantifier becomes redundant and can be removed
if (node.children.length > 0) {
// Get the first (and possibly only) child of the quantifier node
const childNode = node.children[0];
// Update the quantifier node to "become" the child node
node.type = childNode.type; // Change the quantifier node's type to the child's type (e.g., 'LITERAL')
node.value = childNode.value; // Set the quantifier node's value to the child's value
// Replace the quantifier node's children with the child's children
node.children = childNode.children;
// Ensure the new children (if any) have their parent updated to this node
if (node.children.length > 0)
node.children.forEach(child => child.parent = node);
// Log the optimization
changes.add(`Removed redundant quantifier ${baseValue} and simplified to its child node ${childNode.type}`);
} else {
// In case the quantifier node has no children, just clear its value and type
node.type = '';
node.value = '';
node.children = [];
changes.add(`Removed redundant quantifier ${baseValue} (no children to move up). THAT SHOULD NOT BE POSSIBLE`);
}
optimizationDone = true; // Mark that optimization has been done
break;
case '{0}':
case '{0}?':
// Handle the case where the quantifier is {0} or {0}?
newQuantifier = ''; // Simplifying the quantifier
// Check if the node has children and remove them (because {0} means no children)
let removeNode = node.children.pop(); // Pop removes the last child
if (removeNode) {
// Now, remove the quantifier node from the parent's children
node.parent.children = node.parent.children.filter(child => child !== node);
// Log the change for debugging or tracking purposes
changes.add(`Removed node with zero quantifier ${baseValue}`);
}
optimizationDone = true; // Mark that optimization has been done
break;
case '?':
case '*':
case '+':
break;
case '+?':
{
let childNode = node.children[0];
node.parent.children = node.parent.children.map(child => child === node ? childNode : child);
childNode.parent = node.parent;
changes.add(`Removed lazy +? quantifier and replaced with ${childNode.type}`);
optimizationDone = true; // Mark that optimization has been done
}
break;
case '*?':
{
// Updated decision table for lazy quantifier optimization
const optimizationTable = {
// No prevNode (null) cases
'null-null-literal': 'removeLazy',
'null-null-nonLiteral': 'keep',
'null-literal-literal': 'removeLazy',
'null-literal-nonLiteral': 'keep',
'null-nonLiteral-literal': 'removeLazy',
'null-nonLiteral-nonLiteral': 'keep',
'null-assertion-literal': 'removeLazy', // Lazy quantifier can be removed if nextNode is an assertion
'null-assertion-nonLiteral': 'removeLazy',
// PrevNode is a literal
'literal-null-literal': 'removeLazy',
'literal-null-nonLiteral': 'keep',
'literal-literal-literal': 'keep',
'literal-literal-nonLiteral': 'keep',
'literal-nonLiteral-literal': 'removeLazy',
'literal-nonLiteral-nonLiteral': 'keep',
'literal-assertion-literal': 'removeLazy', // Lazy quantifier can be removed if nextNode is an assertion
'literal-assertion-nonLiteral': 'removeLazy',
// PrevNode is a nonLiteral
'nonLiteral-null-literal': 'removeLazy',
'nonLiteral-null-nonLiteral': 'keep',
'nonLiteral-literal-literal': 'keep',
'nonLiteral-literal-nonLiteral': 'keep',
'nonLiteral-nonLiteral-literal': 'removeLazy',
'nonLiteral-nonLiteral-nonLiteral': 'keep',
'nonLiteral-assertion-literal': 'removeLazy', // Lazy quantifier can be removed if nextNode is an assertion
'nonLiteral-assertion-nonLiteral': 'removeLazy'
};
// Helper function to classify node types and create a key for the table
function classifyNode(node) {
if (node === null) return 'null';
if (node.type === 'LITERAL' ) return 'literal';
if (node.type === 'PREDEFINED' && node.value != '\b' && node.value != '\B' ) return 'literal';
if (node.type === 'CHAR_CLASS' ) return 'literal';
if (node.type === 'ANCHOR_START' || Node.type === 'ANCHOR_END') return 'assertion'; // Treat both as "assertion"
if (node.type === 'PREDEFINED'&& (node.value === '\\b' || node.value === '\\B' ) ) return 'assertion';
return 'nonLiteral';
}
const nextNode = nextSibling(node); // Get next sibling
const prevNode = previousSibling(node); // Get previous sibling
// Classify the node types
const prevNodeType = classifyNode(prevNode);
const nextNodeType = classifyNode(nextNode);
const quantifierNodeType = classifyNode(node.children[0]);
// Generate the key for the decision table
const decisionKey = `${prevNodeType}-${nextNodeType}-${quantifierNodeType}`;
// Look up the action in the optimization table
const action = optimizationTable[decisionKey];
const isAnchorOrLookaround = nextNode && ['ANCHOR_START', 'ANCHOR_END', 'LOOKAHEAD', 'NEGATIVE_LOOKAHEAD', 'LOOKBEHIND', 'NEGATIVE_LOOKBEHIND'].includes(nextNode.type);
const isWordBoundary = nextNode && nextNode.type === 'PREDEFINED_CLASS' && (nextNode.value === '\\b' || nextNode.value === '\\B');
// Execute the corresponding action
switch (action) {
case 'remove':
// Fully remove the lazy quantifier (replace *? with an empty string)
let childNode = node.children[0];
node.type = childNode.type;
node.value = childNode.value;
node.children = childNode.children;
node.children.forEach(child => child.parent = node);
changes.add(`Removed redundant lazy quantifier entirely followed by ${childNode.type}`);
optimizationDone = true; // Mark that optimization has been done
// changes.add(`Removed lazy quantifier *? entirely for scenario: ${tableKey}`);
break;
case 'removeLazy':
// Remove only the lazy quantifier (replace *? with *)
node.value = '*';
changes.add(`Removed lazy quantifier and kept greedy * for scenario: ${decisionKey}`);
optimizationDone = true; // Mark that optimization has been done
break;
case 'keep':
// Retain the lazy quantifier (*?) as it ensures minimal matching
changes.add(`Kept lazy quantifier *? for scenario: ${decisionKey}`);
optimizationDone = true; // Mark that optimization has been done
break;
default:
changes.add(`No action for scenario: ${decisionKey}`);
break;
}
}
break;
case '??':
let childNode = node.children[0];
node.parent.children = node.parent.children.map(child => child === node ? childNode : child);
childNode.parent = node.parent;
changes.add(`Removed lazy ?? quantifier and replaced with ${childNode.type}`);
optimizationDone = true; // Mark that optimization has been done
break;
}
// 3. Final Step: Only apply newQuantifier if no prior optimizations were done
if (!optimizationDone && (newQuantifier )) {
node.value = newQuantifier + (isLazy ? '?' : '');
changes.add(changeDesc + node.value);
}
break;
case 'NON_CAPTURING':
// Check if the non-capturing group is unnecessary
if (node.children.length === 0) {// Empty Non-capture
// Now, remove the empty non-capture node from the parent's children
node.parent.children = node.parent.children.filter(child => child !== node);
// Log the change for debugging or tracking purposes
changes.add(`Removed empty non-capture`);
}
if (node.children.length === 1) {
let child = node.children[0];
// Simple literals, predefined classes, or quantifiers
if (['LITERAL', 'PREDEFINED_CLASS', 'CHAR_CLASS','GROUP'].includes(child.type) || (child.type === 'QUANTIFIER' && !/[\{\}]/.test(child.value))) {
node.type = child.type;
node.value = child.value;
node.children = child.children;
node.children.forEach(c => c.parent = node);
changes.add(`Removed unnecessary non-capturing group `+(child.value?`around ${child.value}`:``));
} else if (node.children.every(child => child.type === 'ALTERNATION') &&
(nextSibling(node)==null||nextSibling(node).type==='ANCHOR_END')) {
// Handle simple alternation: (?:a|b) -> a|b
node.type = 'ALTERNATION';
node.value = ''; // No value needed for alternation
node.children = node.children[0].children; // Flatten the alternation group
// Update the parent reference for the alternation's children
node.children.forEach(c => c.parent = node);
// Log the change
changes.add(`Removed unnecessary non-capturing group around alternation.`);
}
}
// Handle non-capturing group at start or end of regex
if (node.parent && ['ANCHOR_START', 'ANCHOR_END'].includes(node.parent.type)) {
node.type = node.children[0].type;
node.value = node.children[0].value;
node.children = node.children[0].children;
node.children.forEach(c => c.parent = node);
changes.add(`Removed unnecessary non-capturing group at start(^)/end($) of regex.`);
}
break;
case 'ALTERNATION': {
// Helper functions for prefix and suffix calculation
function commonPrefix(a, b) {
let minLength = Math.min(a.length, b.length);
for (let i = 0; i < minLength; i++) {
if (a[i] !== b[i])
return a.substring(0, i);
}
return a.substring(0, minLength);
}
function commonSuffix(a, b) {
let minLength = Math.min(a.length, b.length);
for (let i = 0; i < minLength; i++) {
if (a[a.length - 1 - i] !== b[b.length - 1 - i])
return a.substring(a.length - i);
}
return a.substring(a.length - minLength);
}
// Helper function to identify the scenario for alternation
function identifyAlternationScenario(children) {
// Check if all children are either character classes, predefined classes, or single-character literals
const isCharClassOrPredefined = children.every(child => {
// Check for character classes or predefined classes directly
return ['CHAR_CLASS', 'PREDEFINED_CLASS'].includes(child.type) ||
(child.type === 'LITERAL' && child.value.length === 1); // Single-character literal
});
// Check if all children are single-character literals
const isLiteralGroups = children.every(child =>
child.type === 'LITERAL'
);
// Check if all children are single-character literals
const isSingleLiteralGroups = children.every(child =>
(child.type === 'LITERAL' && child.value.length === 1)
);
// Return the identified scenarios
return { isCharClassOrPredefined, isLiteralGroups, isSingleLiteralGroups };
}
// Detect alternation scenario
const { isCharClassOrPredefined, isLiteralGroups, isSingleLiteralGroups } = identifyAlternationScenario(node.children);
// Optimization for char classes or predefined classes in alternation
if (isCharClassOrPredefined) {
const mergedClasses = node.children.map(child => child.value.replace(/^\[|\]$/g, '')).join('');
const { normalized, isNegated } = normalizeCharacterClass(mergedClasses);
node.value = `[${isNegated ? '^' : ''}${normalized}]`;
node.children = []; // Clear the children since we replaced them with the merged class
node.type = 'CHAR_CLASS';
changes.add(`Merged character classes in alternation: ${mergedClasses}`);
break;
}
// Optimization for multi-character LITERAL alternation with common prefix/suffix
if (isLiteralGroups && !isSingleLiteralGroups) {
const values = node.children.map(literal => literal.value); // Direct access to literals
const hasEmptyAlternation = values[values.length - 1] === '';
const prefix = values.reduce(commonPrefix);
const suffix = values.reduce(commonSuffix, values[0]);
// Handle common prefix but no suffix (e.g., abc|aef -> a(bc|ef))
if (prefix.length > 0 && suffix.length === 0) {
const groupNode = new RegexNode('GROUP', '', node.parent);
// Add the common prefix as a LITERAL node
groupNode.addChild(new RegexNode('LITERAL', prefix, groupNode));
// Collect the remaining parts after the common prefix (e.g., bc, ef)
const remaining = values.map(value => value.substring(prefix.length));
// Check if the remaining parts can be optimized into a character class
if (remaining.every(part => part.length === 1)) {
// If all remaining parts are single characters, create a character class
const charClass = `[${remaining.join('')}]`;
groupNode.addChild(new RegexNode('CHAR_CLASS', charClass, groupNode));
changes.add(`Packed into character class after prefix: ${prefix}${charClass}`);
} else {
// Otherwise, create a new alternation node for the remaining parts
const alternationNode = new RegexNode('ALTERNATION', '', groupNode);
remaining.forEach(part => {
alternationNode.addChild(new RegexNode('LITERAL', part, alternationNode));
});
groupNode.addChild(alternationNode);
changes.add(`Factored out common prefix with alternation: ${prefix}`);
}
// Replace the alternation node with the optimized node
node.type = groupNode.type;
node.value = groupNode.value;
node.children = groupNode.children;
node.children.forEach(child => child.parent = node);
break;
}
// Handle common suffix (e.g., abc|dbc -> [ad]bc)
if (suffix.length > 1) {
const remainingPrefixes = values.map(value => value.substring(0, value.length - suffix.length));
if (remainingPrefixes.every(prefix => prefix.length === 1)) {
const charClass = `[${remainingPrefixes.join('')}]`;
const groupNode = new RegexNode('GROUP', '', node.parent);
groupNode.addChild(new RegexNode('CHAR_CLASS', charClass, groupNode));
for (let i = 0; i < suffix.length; i++) {
groupNode.addChild(new RegexNode('LITERAL', suffix[i], groupNode));
}
node.type = groupNode.type;
node.value = groupNode.value;
node.children = groupNode.children;
node.children.forEach(child => child.parent = node);
changes.add(`Factored out common suffix in alternation: ${suffix}`);
break;
}
}
}
// Optimization for single-character alternation into CHAR_CLASS
if (isSingleLiteralGroups) {
const literalValues = node.children.map(child => child.value); // Direct access to literals
const hasEmptyAlternation = literalValues[literalValues.length - 1] === '';
if (literalValues.length > 0) {
let charClass = `[${literalValues.join('')}]`;
const { normalized, isNegated } = normalizeCharacterClass(charClass);
charClass = `[${isNegated ? '^' : ''}${normalized}]`;
const charClassNode = new RegexNode('CHAR_CLASS', charClass, node.parent);
if (hasEmptyAlternation) {
const quantifierNode = new RegexNode('QUANTIFIER', '?', node.parent);
quantifierNode.addChild(charClassNode);
node.type = quantifierNode.type;
node.value = quantifierNode.value;
node.children = quantifierNode.children;
} else {
node.type = 'CHAR_CLASS';
node.value = charClass;
node.children = [];
}
changes.add(`Optimized alternation of single literals into character class: ${charClass}`);
break;
}
}
break;
}
case 'LOOKAHEAD':
case 'NEGATIVE_LOOKAHEAD':
// Handle both positive and negative lookaheads
// Check if the lookahead or negative lookahead is empty
if (node.children.length === 0) { // Empty lookahead
// Now, remove the empty lookahead node from the parent's children
node.parent.children = node.parent.children.filter(child => child !== node);
// Log the change for debugging or tracking purposes
changes.add(`Removed empty lookahead`);
break;
}
// Get the lookahead content directly from node children
let lookaheadContent = node.children.map(child => child.value || '').join('');
let parent = node.parent;
// 1. Redundant Lookahead Removal: Check if the lookahead is immediately followed by the same content
if (parent && parent.children && parent.children.length > 0) {
// Get the literals following the lookahead
let nextLiterals = parent.children.slice(parent.children.indexOf(node) + 1)
.filter(child => child.type === 'LITERAL')
.map(child => child.value)
.join('');
// If the lookahead content matches the following literals
if (lookaheadContent === nextLiterals) {
// Remove the lookahead node
parent.children = parent.children.filter(child => child !== node);
changes.add(`Removed redundant lookahead: (?=${lookaheadContent})`);
}
}
// 2. Combine Multiple Lookaheads: Combine consecutive lookaheads (both positive and negative)
if (parent && parent.children.filter(c => c.type.endsWith('LOOKAHEAD')).length > 1) {
let lookaheadSiblings = parent.children.filter(c => c.type.endsWith('LOOKAHEAD'));
let combinedContent = lookaheadSiblings.map(la => la.children.map(c => c.value).join('')).join('|');
// Create a new lookahead node with combined content
let combinedLookahead = new RegexNode(node.type, null, parent);
let combinedAlternationNode = new RegexNode('ALTERNATION', combinedContent, combinedLookahead);
combinedLookahead.addChild(combinedAlternationNode);
// Remove old lookaheads and replace them with the combined one
parent.children = parent.children.filter(c => !lookaheadSiblings.includes(c));
parent.addChild(combinedLookahead);
changes.add(`Combined multiple lookaheads into: (?=${combinedContent})`);
}
// 3. Handle Quantifiers within Lookaheads: Ensure quantifiers are properly accounted for
if (node.children.some(child => child.type === 'QUANTIFIER')) {
node.children.forEach(child => {
if (child.type === 'QUANTIFIER') {
let quantifierValue = child.value;
let quantifiedNode = child.children[0];
changes.add(`Detected quantifier (${quantifierValue}) in lookahead for: ${quantifiedNode.value}`);
}
});
}
break;
case 'LOOKBEHIND':
case 'NEGATIVE_LOOKBEHIND': {
// Handle both positive and negative lookbehinds
let parent = node.parent;
// 1. Empty Lookbehind Removal
if (node.children.length === 0) {
// Remove the empty lookbehind node
node.parent.children = node.parent.children.filter(child => child !== node);
changes.add(`Removed empty lookbehind`);
break;
}
// 2. Combine Multiple Lookbehinds: Combine consecutive lookbehinds (both positive and negative)
if (parent && parent.children.filter(c => c.type.endsWith('LOOKBEHIND')).length > 1) {
let lookbehinds = parent.children;
let consecutiveLookbehinds = [];
let previousIsLookbehind = false;
// Traverse through all the parent children to find consecutive lookbehinds
lookbehinds.forEach((child, index) => {
if (child.type.endsWith('LOOKBEHIND')) {
if (previousIsLookbehind) {
consecutiveLookbehinds.push(child); // Continue collecting consecutive lookbehinds
} else {
consecutiveLookbehinds = [child]; // Start a new group of consecutive lookbehinds
}
previousIsLookbehind = true;
} else {
if (consecutiveLookbehinds.length > 1) {
// We found a group of consecutive lookbehinds, so combine them
combineLookbehinds(consecutiveLookbehinds, parent);
changes.add(`Combined consecutive lookbehinds into: (?<=${getCombinedContent(consecutiveLookbehinds)})`);
}
previousIsLookbehind = false;
consecutiveLookbehinds = []; // Reset if we encounter a non-lookbehind node
}
});
// Check at the end of the traversal in case there are consecutive lookbehinds at the end
if (consecutiveLookbehinds.length > 1) {
combineLookbehinds(consecutiveLookbehinds, parent);
changes.add(`Combined consecutive lookbehinds into: (?<=${getCombinedContent(consecutiveLookbehinds)})`);
}
}
// Helper function to combine consecutive lookbehinds
function combineLookbehinds(lookbehinds, parent) {
// Create a new lookbehind node with ALTERNATION
let combinedLookbehind = new RegexNode(lookbehinds[0].type, null, parent);
let alternationNode = new RegexNode('ALTERNATION', '', combinedLookbehind);
// For each consecutive lookbehind, add a LITERAL node with the content to the ALTERNATION node
lookbehinds.forEach(lb => {
lb.children.forEach(child => {
let literalNode = new RegexNode('LITERAL', child.value, alternationNode);
alternationNode.addChild(literalNode);
});
});
// Attach the alternation node to the combined lookbehind
combinedLookbehind.addChild(alternationNode);
// Find the position of the first lookbehind in the consecutive group
let indexOfFirstLookbehind = parent.children.indexOf(lookbehinds[0]);
// Remove old lookbehinds from parent and insert the combined lookbehind at the correct position
parent.children = parent.children.filter(c => !lookbehinds.includes(c));
parent.children.splice(indexOfFirstLookbehind, 0, combinedLookbehind); // Insert at the original position
}
// Helper function to get combined content for logging purposes
function getCombinedContent(lookbehinds) {
return lookbehinds.map(lb => lb.children.map(c => c.value).join('|')).join('|');
}
// 3. Redundant Lookbehind Removal
let lookbehindContent = node.children.map(child => child.value || '').join('');
// Ensure we only check literals before the current node
if (parent && parent.children.length > 0) {
let lookbehindIndex = parent.children.indexOf(node); // Get the position of the current lookbehind
let prevLiterals = parent.children.slice(lookbehindIndex-1, lookbehindIndex) // Only consider nodes before the lookbehind
.filter(child => child.type === 'LITERAL') // Only consider literal nodes
.map(child => child.value)
.reverse() // Reverse since we're looking backward
.join('');
// If the lookbehind content matches the preceding literals
if (lookbehindContent === prevLiterals) {
// Remove the redundant lookbehind node
parent.children = parent.children.filter(child => child !== node);
changes.add(`Removed redundant lookbehind: (?<=${lookbehindContent})`);
}
}
// 4. Handle Quantifiers within Lookbehinds: Ensure quantifiers are properly accounted for
if (node.children.some(child => child.type === 'QUANTIFIER')) {
node.children.forEach(child => {
if (child.type === 'QUANTIFIER') {
let quantifierValue = child.value;
let quantifiedNode = child.children[0];
changes.add(`Detected quantifier (${quantifierValue}) in lookbehind for: ${quantifiedNode.value}`);
}
});
}
break;
}
}
return Array.from(changes).join('\n');
}