Brzozowski derivative
In theoretical computer science, in particular in formal language theory, the Brzozowski derivative u−1S of a set S of strings and a string u is defined as the set of all strings obtainable from a string in S by cutting off a prefixing u, formally: u−1S = { v ∈ Σ*: uv ∈ S }, cf. picture. It was introduced under various different names since the late 1950s.[1][2][3] Today it is named after the computer scientist Janusz Brzozowski who investigated their properties and gave an algorithm to compute the derivative of a generalized regular expression.[4]
Definition
Even though originally studied for regular expressions, the definition applies to arbitrary formal languages. Given any formal language S over an alphabet Σ and any string u ∈ Σ*, the derivative of S with respect to u is defined as:[5]
- u−1S = { v ∈ Σ*: uv ∈ S }
An equivalent way of stating this is, for all u,v ∈ Σ*:
- v ∈ u−1S iff uv ∈ S
which provides some intuition for the notation.
From the definition, for all u,v,w ∈ Σ*:
- v ∈ (uw)−1S iff uwv ∈ S iff wv ∈ u−1S iff v ∈ w−1(u−1S)
so (uw)−1S = w−1(u−1S).
The derivative with respect to an arbitrary string reduces to successive derivatives over the symbols of that string because, for a∈ Σ, u∈ Σ*:
(ua)−1S = a−1(u−1S) ε−1S = S
A language L⊆ Σ* is called nullable if it contains the empty string, that is, ε ∈ L. Each language S is uniquely determined by nullability of its derivatives:
- w ∈ S iff ε ∈ w−1S
A language can be viewed as a (potentially infinite) boolean-labelled tree (see also tree (set theory) and infinite-tree automaton). Each possible string w ∈ Σ* denotes a position in the tree, with binary label true when w ∈ S and false when w ∉ S. In this interpretation, the derivative with respect to a symbol a corresponds to computing the subtree obtained by following the edge a. Decomposing tree into the root and the subtrees a−1S corresponds to the following equality, which holds for every formal language S⊆ Σ*:
- S = ({ε}∩S) ∪ ⋃a∈Σ a(a−1S).
Derivatives of generalized regular expressions
When a language is given by a regular expression, the concept of derivatives leads to an algorithm for deciding whether a given word belongs to the regular expression.
Given a finite alphabet A of symbols,[6] a generalized regular expression denotes a possibly infinite set of finite-length strings of symbols from A. It may be built of:
- ∅ (denoting the empty set of strings),
- ε (denoting the singleton set containing just the empty string),
- a symbol a from A (denoting the singleton set containing the single-symbol string a),
- R∨S (where R and S are, in turn, generalized regular expressions; denoting their set's union),
- R∧S (denoting the intersection of R 's and S 's set),
- ¬R (denoting the complement of R 's set with respect to the set of all strings of symbols from A),
- RS (denoting the set of all possible concatenations of strings from R 's and S 's set),
- R* (denoting the set of n-fold repetitions of strings from R 's set, for any n≥0, including the empty string).
In an ordinary regular expression, neither ∧ nor ¬ is allowed. The string set denoted by a generalized regular expression R is called its language, denoted as L(R).
Computation
For any given generalized regular expression R and any string u, the derivative u−1R is again a generalized regular expression.[7] It may be computed recursively as follows.[8]
(ua)−1R | = a−1(u−1R) | for a symbol a and a string u |
ε−1R | = R |
Using the previous two rules, the derivative with respect to an arbitrary string is explained by the derivative with respect to a single-symbol string a. The latter can be computed as follows:[9]
a−1a | = ε | |
a−1b | = ∅ | for each symbol b≠a |
a−1ε | = ∅ | |
a−1∅ | = ∅ | |
a−1(R*) | = (a−1R)R* | |
a−1(RS) | = (a−1R)S ∨ ν(R)a−1S | |
a−1(R∧S) | = (a−1R) ∧ (a−1S) | |
a−1(R∨S) | = (a−1R) ∨ (a−1S) | |
a−1(¬R) | = ¬(a−1R) |
Here, ν(R) is an auxiliary function yielding a generalized regular expression that evaluates to the empty string ε if R 's language contains ε, and otherwise evaluates to ∅. This function can be computed by the following rules:[10]
ν(a) | = ∅ | for any symbol a |
ν(ε) | = ε | |
ν(∅) | = ∅ | |
ν(R*) | = ε | |
ν(RS) | = ν(R) ∧ ν(S) | |
ν(R ∧ S) | = ν(R) ∧ ν(S) | |
ν(R ∨ S) | = ν(R) ∨ ν(S) | |
ν(¬R) | = ε | if ν(R) = ∅ |
ν(¬R) | = ∅ | if ν(R) = ε |
Properties
A string u is a member of the string set denoted by a generalized regular expression R if and only if ε is a member of the string set denoted by the derivative u−1R.[11]
Considering all the derivatives of a fixed generalized regular expression R results in only finitely many different languages. If their number is denoted by dR, all these languages can be obtained as derivatives of R with respect to string of length below dR.[12] Furthermore, there is a complete deterministic finite automaton with dR states which recognises the regular language given by R, as laid out by the Myhill–Nerode theorem.
Derivatives of context-free languages
Derivatives are also effectively computable for recursively defined equations with regular expression operators, which are equivalent to context-free grammars. This insight was used to derive parsing algorithms for context-free languages.[13] Implementation of such algorithms have shown to have cubic complexity,[14] corresponding to the complexity of Earley parser on general context-free grammars.
See also
References
- George N. Raney (Apr 1958). "Sequential functions". Journal of the ACM. 5 (2): 177–180.
- Dana Scott and Michael Rabin (Apr 1959). "Finite Automata and Their Decision Problems" (PDF). IBM Journal of Research and Development. 3 (2): 114–125.
- C.C. Elgot and J.D. Rutledge (Oct 1961). "Operations on finite automata". In Robert S. Ledley (ed.). Proc. AIEE 2nd Ann. Symp. on Switching, Circuit Theory, and Logical Design (SWCT), Detroit. pp. 129–132. doi:10.1109/FOCS.1961.26.
- Janusz A. Brzozowski (1964). "Derivatives of Regular Expressions". J ACM. 11 (4): 481–494. doi:10.1145/321239.321249.
- Janusz A. Brzozowski (1964). "Derivatives of Regular Expressions". J ACM. 11 (4): 481–494. doi:10.1145/321239.321249.
- Brzozowski (1964), p.481, required A to consist of the 2n combinations of n bits, for some n.
- Brzozowski (1964), p.483, Theorem 4.1
- Brzozowski (1964), p.483, Theorem 3.2
- Brzozowski (1964), p.483, Theorem 3.1
- Brzozowski (1964), p.482, Definition 3.2
- Brzozowski (1964), p.483, Theorem 4.2
- Brzozowski (1964), p.484, Theorem 4.3
- Matthew Might; David Darais; Daniel Spiewak (2011). Parsing with derivatives: a functional pearl. Proceeding of the 16th ACM SIGPLAN international conference on Functional Programming (ICFP). pp. 189–195. doi:10.1145/2034773.2034801.
- Michael D. Adams; Celeste Hollenbeck; Matthew Might (2016). On the complexity and performance of parsing with derivatives. Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). pp. 224–236. doi:10.1145/2908080.2908128.