1<div align="center">
2 <img alt="reghex" width="250" src="docs/reghex-logo.png" />
3 <br />
4 <br />
5 <strong>
6 The magical sticky regex-based parser generator
7 </strong>
8 <br />
9 <br />
10 <br />
11</div>
12
13Leveraging the power of sticky regexes and JS code generation, `reghex` allows
14you to code parsers quickly, by surrounding regular expressions with a regex-like
15[DSL](https://en.wikipedia.org/wiki/Domain-specific_language).
16
17With `reghex` you can generate a parser from a tagged template literal, which is
18quick to prototype and generates reasonably compact and performant code.
19
20_This project is still in its early stages and is experimental. Its API may still
21change and some issues may need to be ironed out._
22
23## Quick Start
24
25##### 1. Install with yarn or npm
26
27```sh
28yarn add reghex
29# or
30npm install --save reghex
31```
32
33##### 2. Add the plugin to your Babel configuration _(optional)_
34
35In your `.babelrc`, `babel.config.js`, or `package.json:babel` add:
36
37```json
38{
39 "plugins": ["reghex/babel"]
40}
41```
42
43Alternatively, you can set up [`babel-plugin-macros`](https://github.com/kentcdodds/babel-plugin-macros) and
44import `reghex` from `"reghex/macro"` instead.
45
46This step is **optional**. `reghex` can also generate its optimised JS code during runtime.
47This will only incur a tiny parsing cost on initialisation, but due to the JIT of modern
48JS engines there won't be any difference in performance between pre-compiled and compiled
49versions otherwise.
50
51Since the `reghex` runtime is rather small, for larger grammars it may even make sense not
52to precompile the matchers at all. For this case you may pass the `{ "codegen": false }`
53option to the Babel plugin, which will minify the `reghex` matcher templates without
54precompiling them.
55
56##### 3. Have fun writing parsers!
57
58```js
59import { match, parse } from 'reghex';
60
61const name = match('name')`
62 ${/\w+/}
63`;
64
65parse(name)('hello');
66// [ "hello", .tag = "name" ]
67```
68
69## Concepts
70
71The fundamental concept of `reghex` are regexes, specifically
72[sticky regexes](https://www.loganfranken.com/blog/831/es6-everyday-sticky-regex-matches/)!
73These are regular expressions that don't search a target string, but instead match at the
74specific position they're at. The flag for sticky regexes is `y` and hence
75they can be created using `/phrase/y` or `new RegExp('phrase', 'y')`.
76
77**Sticky Regexes** are the perfect foundation for a parsing framework in JavaScript!
78Because they only match at a single position they can be used to match patterns
79continuously, as a parser would. Like global regexes, we can then manipulate where
80they should be matched by setting `regex.lastIndex = index;` and after matching
81read back their updated `regex.lastIndex`.
82
83> **Note:** Sticky Regexes aren't natively
84> [supported in any versions of Internet Explorer](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/sticky#Browser_compatibility). `reghex` works around this by imitating its behaviour, which may decrease performance on IE11.
85
86This primitive allows us to build up a parser from regexes that you pass when
87authoring a parser function, also called a "matcher" in `reghex`. When `reghex` compiles
88to parser code, this code is just a sequence and combination of sticky regexes that
89are executed in order!
90
91```js
92let input = 'phrases should be parsed...';
93let lastIndex = 0;
94
95const regex = /phrase/y;
96function matcher() {
97 let match;
98 // Before matching we set the current index on the RegExp
99 regex.lastIndex = lastIndex;
100 // Then we match and store the result
101 if ((match = regex.exec(input))) {
102 // If the RegExp matches successfully, we update our lastIndex
103 lastIndex = regex.lastIndex;
104 }
105}
106```
107
108This mechanism is used in all matcher functions that `reghex` generates.
109Internally `reghex` keeps track of the input string and the current index on
110that string, and the matcher functions execute regexes against this state.
111
112## Authoring Guide
113
114You can write "matchers" by importing the `match` import from `reghex` and
115using it to write a matcher expression.
116
117```js
118import { match } from 'reghex';
119
120const name = match('name')`
121 ${/\w+/}
122`;
123```
124
125As can be seen above, the `match` function, is called with a "node name" and
126is then called as a tagged template. This template is our **parsing definition**.
127
128`reghex` functions only with its Babel plugin, which will detect `match('name')`
129and replace the entire tag with a parsing function, which may then look like
130the following in your transpiled code:
131
132```js
133import { _pattern /* ... */ } from 'reghex';
134
135var _name_expression = _pattern(/\w+/);
136var name = function name() {
137 /* ... */
138};
139```
140
141We've now successfully created a matcher, which matches a single regex, which
142is a pattern of one or more letters. We can execute this matcher by calling
143it with the curried `parse` utility:
144
145```js
146import { parse } from 'reghex';
147
148const result = parse(name)('Tim');
149
150console.log(result); // [ "Tim", .tag = "name" ]
151console.log(result.tag); // "name"
152```
153
154If the string (Here: "Tim") was parsed successfully by the matcher, it will
155return an array that contains the result of the regex. The array is special
156in that it will also have a `tag` property set to the matcher's name, here
157`"name"`, which we determined when we defined the matcher as `match('name')`.
158
159```js
160import { parse } from 'reghex';
161parse(name)('42'); // undefined
162```
163
164Similarly, if the matcher does not parse an input string successfully, it will
165return `undefined` instead.
166
167### Nested matchers
168
169This on its own is nice, but a parser must be able to traverse a string and
170turn it into an [Abstract Syntax Tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree).
171To introduce nesting to `reghex` matchers, we can refer to one matcher in another!
172Let's extend our original example;
173
174```js
175import { match } from 'reghex';
176
177const name = match('name')`
178 ${/\w+/}
179`;
180
181const hello = match('hello')`
182 ${/hello /} ${name}
183`;
184```
185
186The new `hello` matcher is set to match `/hello /` and then attempts to match
187the `name` matcher afterwards. If either of these matchers fail, it will return
188`undefined` as well and roll back its changes. Using this matcher will give us
189**nested abstract output**.
190
191We can also see in this example that _outside_ of the regex interpolations,
192whitespace and newlines don't matter.
193
194```js
195import { parse } from 'reghex';
196
197parse(hello)('hello tim');
198/*
199 [
200 "hello",
201 ["tim", .tag = "name"],
202 .tag = "hello"
203 ]
204*/
205```
206
207Furthermore, interpolations don't have to just be RegHex matchers. They can
208also be functions returning matchers or completely custom matching functions.
209This is useful when your DSL becomes _self-referential_, i.e. when one matchers
210start referencing each other forming a loop. To fix this we can create a
211function that returns our root matcher:
212
213```js
214import { match } from 'reghex';
215
216const value = match('value')`
217 (${/\w+/} | ${() => root})+
218`;
219
220const root = match('root')`
221 ${/root/}+ ${value}
222`;
223```
224
225### Regex-like DSL
226
227We've seen in the previous examples that matchers are authored using tagged
228template literals, where interpolations can either be filled using regexes,
229`${/pattern/}`, or with other matchers `${name}`.
230
231The tagged template syntax supports more ways to match these interpolations,
232using a regex-like Domain Specific Language. Unlike in regexes, whitespace
233and newlines don't matter, which makes it easier to format and read matchers.
234
235We can create **sequences** of matchers by adding multiple expressions in
236a row. A matcher using `${/1/} ${/2/}` will attempt to match `1` and then `2`
237in the parsed string. This is just one feature of the regex-like DSL. The
238available operators are the following:
239
240| Operator | Example | Description |
241| -------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
242| `?` | `${/1/}?` | An **optional** may be used to make an interpolation optional. This means that the interpolation may or may not match. |
243| `*` | `${/1/}*` | A **star** can be used to match an arbitrary amount of interpolation or none at all. This means that the interpolation may repeat itself or may not be matched at all. |
244| `+` | `${/1/}+` | A **plus** is used like `*` and must match one or more times. When the matcher doesn't match, that's considered a failing case, since the match isn't optional. |
245| `\|` | `${/1/} \| ${/2/}` | An **alternation** can be used to match either one thing or another, falling back when the first interpolation fails. |
246| `()` | `(${/1/} ${/2/})+` | A **group** can be used to apply one of the other operators to an entire group of interpolations. |
247| `(?: )` | `(?: ${/1/})` | A **non-capturing group** is like a regular group, but the interpolations matched inside it don't appear in the parser's output. |
248| `(?= )` | `(?= ${/1/})` | A **positive lookahead** checks whether interpolations match, and if so continues the matcher without changing the input. If it matches, it's essentially ignored. |
249| `(?! )` | `(?! ${/1/})` | A **negative lookahead** checks whether interpolations _don't_ match, and if so continues the matcher without changing the input. If the interpolations do match the matcher is aborted. |
250
251We can combine and compose these operators to create more complex matchers.
252For instance, we can extend the original example to only allow a specific set
253of names by using the `|` operator:
254
255```js
256const name = match('name')`
257 ${/tim/} | ${/tom/} | ${/tam/}
258`;
259
260parse(name)('tim'); // [ "tim", .tag = "name" ]
261parse(name)('tom'); // [ "tom", .tag = "name" ]
262parse(name)('patrick'); // undefined
263```
264
265The above will now only match specific name strings. When one pattern in this
266chain of **alternations** does not match, it will try the next one.
267
268We can also use **groups** to add more matchers around the alternations themselves,
269by surrounding the alternations with `(` and `)`
270
271```js
272const name = match('name')`
273 (${/tim/} | ${/tom/}) ${/!/}
274`;
275
276parse(name)('tim!'); // [ "tim", "!", .tag = "name" ]
277parse(name)('tom!'); // [ "tom", "!", .tag = "name" ]
278parse(name)('tim'); // undefined
279```
280
281Maybe we're also not that interested in the `"!"` showing up in the output node.
282If we want to get rid of it, we can use a **non-capturing group** to hide it,
283while still requiring it.
284
285```js
286const name = match('name')`
287 (${/tim/} | ${/tom/}) (?: ${/!/})
288`;
289
290parse(name)('tim!'); // [ "tim", .tag = "name" ]
291parse(name)('tim'); // undefined
292```
293
294Lastly, like with regexes, `?`, `*`, and `+` may be used as "quantifiers". The first two
295may also be optional and _not_ match their patterns without the matcher failing.
296The `+` operator is used to match an interpolation _one or more_ times, while the
297`*` operators may match _zero or more_ times. Let's use this to allow the `"!"`
298to repeat.
299
300```js
301const name = match('name')`
302 (${/tim/} | ${/tom/})+ (?: ${/!/})*
303`;
304
305parse(name)('tim!'); // [ "tim", .tag = "name" ]
306parse(name)('tim!!!!'); // [ "tim", .tag = "name" ]
307parse(name)('tim'); // [ "tim", .tag = "name" ]
308parse(name)('timtim'); // [ "tim", tim", .tag = "name" ]
309```
310
311As we can see from the above, like in regexes, quantifiers can be combined with groups,
312non-capturing groups, or other groups.
313
314### Transforming as we match
315
316In the previous sections, we've seen that the **nodes** that `reghex` outputs are arrays containing
317match strings or other nodes and have a special `tag` property with the node's type.
318We can **change this output** while we're parsing by passing a function to our matcher definition.
319
320```js
321const name = match('name', (x) => x[0])`
322 (${/tim/} | ${/tom/}) ${/!/}
323`;
324
325parse(name)('tim'); // "tim"
326```
327
328In the above example, we're passing a small function, `x => x[0]` to the matcher as a
329second argument. This will change the matcher's output, which causes the parser to
330now return a new output for this matcher.
331
332We can use this function creatively by outputting full AST nodes, maybe even like the
333ones that resemble Babel's output:
334
335```js
336const identifier = match('identifier', (x) => ({
337 type: 'Identifier',
338 name: x[0],
339}))`
340 ${/[\w_][\w\d_]+/}
341`;
342
343parse(name)('var_name'); // { type: "Identifier", name: "var_name" }
344```
345
346We've now entirely changed the output of the parser for this matcher. Given that each
347matcher can change its output, we're free to change the parser's output entirely.
348By **returning a falsy value** in this matcher, we can also change the matcher to not have
349matched, which would cause other matchers to treat it like a mismatch!
350
351```js
352import { match, parse } from 'reghex';
353
354const name = match('name')((x) => {
355 return x[0] !== 'tim' ? x : undefined;
356})`
357 ${/\w+/}
358`;
359
360const hello = match('hello')`
361 ${/hello /} ${name}
362`;
363
364parse(name)('tom'); // ["hello", ["tom", .tag = "name"], .tag = "hello"]
365parse(name)('tim'); // undefined
366```
367
368Lastly, if we need to create these special array nodes ourselves, we can use `reghex`'s
369`tag` export for this purpose.
370
371```js
372import { tag } from 'reghex';
373
374tag(['test'], 'node_name');
375// ["test", .tag = "node_name"]
376```
377
378**That's it! May the RegExp be ever in your favor.**