README.md at main · kitten.sh/reghex

kitten.sh / reghex
Mirror: The magical sticky regex-based parser generator 🧙
reghex / README.md
at main 16 kB view raw view rendered
  1<div align="center">
  2  <img alt="reghex" width="250" src="docs/reghex-logo.png" />
  3  <br />
  4  <br />
  5  <strong>
  6    The magical sticky regex-based parser generator
  7  </strong>
  8  <br />
  9  <br />
 10  <br />
 11</div>
 12
 13Leveraging the power of sticky regexes and JS code generation, `reghex` allows
 14you to code parsers quickly, by surrounding regular expressions with a regex-like
 15[DSL](https://en.wikipedia.org/wiki/Domain-specific_language).
 16
 17With `reghex` you can generate a parser from a tagged template literal, which is
 18quick to prototype and generates reasonably compact and performant code.
 19
 20_This project is still in its early stages and is experimental. Its API may still
 21change and some issues may need to be ironed out._
 22
 23## Quick Start
 24
 25##### 1. Install with yarn or npm
 26
 27```sh
 28yarn add reghex
 29# or
 30npm install --save reghex
 31```
 32
 33##### 2. Add the plugin to your Babel configuration _(optional)_
 34
 35In your `.babelrc`, `babel.config.js`, or `package.json:babel` add:
 36
 37```json
 38{
 39  "plugins": ["reghex/babel"]
 40}
 41```
 42
 43Alternatively, you can set up [`babel-plugin-macros`](https://github.com/kentcdodds/babel-plugin-macros) and
 44import `reghex` from `"reghex/macro"` instead.
 45
 46This step is **optional**. `reghex` can also generate its optimised JS code during runtime.
 47This will only incur a tiny parsing cost on initialisation, but due to the JIT of modern
 48JS engines there won't be any difference in performance between pre-compiled and compiled
 49versions otherwise.
 50
 51Since the `reghex` runtime is rather small, for larger grammars it may even make sense not
 52to precompile the matchers at all. For this case you may pass the `{ "codegen": false }`
 53option to the Babel plugin, which will minify the `reghex` matcher templates without
 54precompiling them.
 55
 56##### 3. Have fun writing parsers!
 57
 58```js
 59import { match, parse } from 'reghex';
 60
 61const name = match('name')`
 62  ${/\w+/}
 63`;
 64
 65parse(name)('hello');
 66// [ "hello", .tag = "name" ]
 67```
 68
 69## Concepts
 70
 71The fundamental concept of `reghex` are regexes, specifically
 72[sticky regexes](https://www.loganfranken.com/blog/831/es6-everyday-sticky-regex-matches/)!
 73These are regular expressions that don't search a target string, but instead match at the
 74specific position they're at. The flag for sticky regexes is `y` and hence
 75they can be created using `/phrase/y` or `new RegExp('phrase', 'y')`.
 76
 77**Sticky Regexes** are the perfect foundation for a parsing framework in JavaScript!
 78Because they only match at a single position they can be used to match patterns
 79continuously, as a parser would. Like global regexes, we can then manipulate where
 80they should be matched by setting `regex.lastIndex = index;` and after matching
 81read back their updated `regex.lastIndex`.
 82
 83> **Note:** Sticky Regexes aren't natively
 84> [supported in any versions of Internet Explorer](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/sticky#Browser_compatibility). `reghex` works around this by imitating its behaviour, which may decrease performance on IE11.
 85
 86This primitive allows us to build up a parser from regexes that you pass when
 87authoring a parser function, also called a "matcher" in `reghex`. When `reghex` compiles
 88to parser code, this code is just a sequence and combination of sticky regexes that
 89are executed in order!
 90
 91```js
 92let input = 'phrases should be parsed...';
 93let lastIndex = 0;
 94
 95const regex = /phrase/y;
 96function matcher() {
 97  let match;
 98  // Before matching we set the current index on the RegExp
 99  regex.lastIndex = lastIndex;
100  // Then we match and store the result
101  if ((match = regex.exec(input))) {
102    // If the RegExp matches successfully, we update our lastIndex
103    lastIndex = regex.lastIndex;
104  }
105}
106```
107
108This mechanism is used in all matcher functions that `reghex` generates.
109Internally `reghex` keeps track of the input string and the current index on
110that string, and the matcher functions execute regexes against this state.
111
112## Authoring Guide
113
114You can write "matchers" by importing the `match` import from `reghex` and
115using it to write a matcher expression.
116
117```js
118import { match } from 'reghex';
119
120const name = match('name')`
121  ${/\w+/}
122`;
123```
124
125As can be seen above, the `match` function, is called with a "node name" and
126is then called as a tagged template. This template is our **parsing definition**.
127
128`reghex` functions only with its Babel plugin, which will detect `match('name')`
129and replace the entire tag with a parsing function, which may then look like
130the following in your transpiled code:
131
132```js
133import { _pattern /* ... */ } from 'reghex';
134
135var _name_expression = _pattern(/\w+/);
136var name = function name() {
137  /* ... */
138};
139```
140
141We've now successfully created a matcher, which matches a single regex, which
142is a pattern of one or more letters. We can execute this matcher by calling
143it with the curried `parse` utility:
144
145```js
146import { parse } from 'reghex';
147
148const result = parse(name)('Tim');
149
150console.log(result); // [ "Tim", .tag = "name" ]
151console.log(result.tag); // "name"
152```
153
154If the string (Here: "Tim") was parsed successfully by the matcher, it will
155return an array that contains the result of the regex. The array is special
156in that it will also have a `tag` property set to the matcher's name, here
157`"name"`, which we determined when we defined the matcher as `match('name')`.
158
159```js
160import { parse } from 'reghex';
161parse(name)('42'); // undefined
162```
163
164Similarly, if the matcher does not parse an input string successfully, it will
165return `undefined` instead.
166
167### Nested matchers
168
169This on its own is nice, but a parser must be able to traverse a string and
170turn it into an [Abstract Syntax Tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree).
171To introduce nesting to `reghex` matchers, we can refer to one matcher in another!
172Let's extend our original example;
173
174```js
175import { match } from 'reghex';
176
177const name = match('name')`
178  ${/\w+/}
179`;
180
181const hello = match('hello')`
182  ${/hello /} ${name}
183`;
184```
185
186The new `hello` matcher is set to match `/hello /` and then attempts to match
187the `name` matcher afterwards. If either of these matchers fail, it will return
188`undefined` as well and roll back its changes. Using this matcher will give us
189**nested abstract output**.
190
191We can also see in this example that _outside_ of the regex interpolations,
192whitespace and newlines don't matter.
193
194```js
195import { parse } from 'reghex';
196
197parse(hello)('hello tim');
198/*
199  [
200    "hello",
201    ["tim", .tag = "name"],
202    .tag = "hello"
203  ]
204*/
205```
206
207Furthermore, interpolations don't have to just be RegHex matchers. They can
208also be functions returning matchers or completely custom matching functions.
209This is useful when your DSL becomes _self-referential_, i.e. when one matchers
210start referencing each other forming a loop. To fix this we can create a
211function that returns our root matcher:
212
213```js
214import { match } from 'reghex';
215
216const value = match('value')`
217  (${/\w+/} | ${() => root})+
218`;
219
220const root = match('root')`
221  ${/root/}+ ${value}
222`;
223```
224
225### Regex-like DSL
226
227We've seen in the previous examples that matchers are authored using tagged
228template literals, where interpolations can either be filled using regexes,
229`${/pattern/}`, or with other matchers `${name}`.
230
231The tagged template syntax supports more ways to match these interpolations,
232using a regex-like Domain Specific Language. Unlike in regexes, whitespace
233and newlines don't matter, which makes it easier to format and read matchers.
234
235We can create **sequences** of matchers by adding multiple expressions in
236a row. A matcher using `${/1/} ${/2/}` will attempt to match `1` and then `2`
237in the parsed string. This is just one feature of the regex-like DSL. The
238available operators are the following:
239
240| Operator | Example            | Description                                                                                                                                                                              |
241| -------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
242| `?`      | `${/1/}?`          | An **optional** may be used to make an interpolation optional. This means that the interpolation may or may not match.                                                                   |
243| `*`      | `${/1/}*`          | A **star** can be used to match an arbitrary amount of interpolation or none at all. This means that the interpolation may repeat itself or may not be matched at all.                   |
244| `+`      | `${/1/}+`          | A **plus** is used like `*` and must match one or more times. When the matcher doesn't match, that's considered a failing case, since the match isn't optional.                          |
245| `\|`     | `${/1/} \| ${/2/}` | An **alternation** can be used to match either one thing or another, falling back when the first interpolation fails.                                                                    |
246| `()`     | `(${/1/} ${/2/})+` | A **group** can be used to apply one of the other operators to an entire group of interpolations.                                                                                        |
247| `(?: )`  | `(?: ${/1/})`      | A **non-capturing group** is like a regular group, but the interpolations matched inside it don't appear in the parser's output.                                                         |
248| `(?= )`  | `(?= ${/1/})`      | A **positive lookahead** checks whether interpolations match, and if so continues the matcher without changing the input. If it matches, it's essentially ignored.                       |
249| `(?! )`  | `(?! ${/1/})`      | A **negative lookahead** checks whether interpolations _don't_ match, and if so continues the matcher without changing the input. If the interpolations do match the matcher is aborted. |
250
251A couple of operators also support "short hands" that allow you to write
252lookaheads or non-capturing groups a little quicker.
253
254| Shorthand | Example   | Description                                                                                                                                                                              |
255| --------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
256| `:`       | `:${/1/}` | A **non-capturing group** is like a regular group, but the interpolations matched inside it don't appear in the parser's output.                                                         |
257| `=`       | `=${/1/}` | A **positive lookahead** checks whether interpolations match, and if so continues the matcher without changing the input. If it matches, it's essentially ignored.                       |
258| `!`       | `!${/1/}` | A **negative lookahead** checks whether interpolations _don't_ match, and if so continues the matcher without changing the input. If the interpolations do match the matcher is aborted. |
259
260We can combine and compose these operators to create more complex matchers.
261For instance, we can extend the original example to only allow a specific set
262of names by using the `|` operator:
263
264```js
265const name = match('name')`
266  ${/tim/} | ${/tom/} | ${/tam/}
267`;
268
269parse(name)('tim'); // [ "tim", .tag = "name" ]
270parse(name)('tom'); // [ "tom", .tag = "name" ]
271parse(name)('patrick'); // undefined
272```
273
274The above will now only match specific name strings. When one pattern in this
275chain of **alternations** does not match, it will try the next one.
276
277We can also use **groups** to add more matchers around the alternations themselves,
278by surrounding the alternations with `(` and `)`
279
280```js
281const name = match('name')`
282  (${/tim/} | ${/tom/}) ${/!/}
283`;
284
285parse(name)('tim!'); // [ "tim", "!", .tag = "name" ]
286parse(name)('tom!'); // [ "tom", "!", .tag = "name" ]
287parse(name)('tim'); // undefined
288```
289
290Maybe we're also not that interested in the `"!"` showing up in the output node.
291If we want to get rid of it, we can use a **non-capturing group** to hide it,
292while still requiring it.
293
294```js
295const name = match('name')`
296  (${/tim/} | ${/tom/}) (?: ${/!/})
297`;
298
299parse(name)('tim!'); // [ "tim", .tag = "name" ]
300parse(name)('tim'); // undefined
301```
302
303Lastly, like with regexes, `?`, `*`, and `+` may be used as "quantifiers". The first two
304may also be optional and _not_ match their patterns without the matcher failing.
305The `+` operator is used to match an interpolation _one or more_ times, while the
306`*` operators may match _zero or more_ times. Let's use this to allow the `"!"`
307to repeat.
308
309```js
310const name = match('name')`
311  (${/tim/} | ${/tom/})+ (?: ${/!/})*
312`;
313
314parse(name)('tim!'); // [ "tim", .tag = "name" ]
315parse(name)('tim!!!!'); // [ "tim", .tag = "name" ]
316parse(name)('tim'); // [ "tim", .tag = "name" ]
317parse(name)('timtim'); // [ "tim", tim", .tag = "name" ]
318```
319
320As we can see from the above, like in regexes, quantifiers can be combined with groups,
321non-capturing groups, or other groups.
322
323### Transforming as we match
324
325In the previous sections, we've seen that the **nodes** that `reghex` outputs are arrays containing
326match strings or other nodes and have a special `tag` property with the node's type.
327We can **change this output** while we're parsing by passing a function to our matcher definition.
328
329```js
330const name = match('name', (x) => x[0])`
331  (${/tim/} | ${/tom/}) ${/!/}
332`;
333
334parse(name)('tim'); // "tim"
335```
336
337In the above example, we're passing a small function, `x => x[0]` to the matcher as a
338second argument. This will change the matcher's output, which causes the parser to
339now return a new output for this matcher.
340
341We can use this function creatively by outputting full AST nodes, maybe even like the
342ones that resemble Babel's output:
343
344```js
345const identifier = match('identifier', (x) => ({
346  type: 'Identifier',
347  name: x[0],
348}))`
349  ${/[\w_][\w\d_]+/}
350`;
351
352parse(name)('var_name'); // { type: "Identifier", name: "var_name" }
353```
354
355We've now entirely changed the output of the parser for this matcher. Given that each
356matcher can change its output, we're free to change the parser's output entirely.
357By returning `null` or `undefined` in this matcher, we can also change the matcher
358to not have matched, which would cause other matchers to treat it like a mismatch!
359
360```js
361import { match, parse } from 'reghex';
362
363const name = match('name')((x) => {
364  return x[0] !== 'tim' ? x : undefined;
365})`
366  ${/\w+/}
367`;
368
369const hello = match('hello')`
370  ${/hello /} ${name}
371`;
372
373parse(name)('tom'); // ["hello", ["tom", .tag = "name"], .tag = "hello"]
374parse(name)('tim'); // undefined
375```
376
377Lastly, if we need to create these special array nodes ourselves, we can use `reghex`'s
378`tag` export for this purpose.
379
380```js
381import { tag } from 'reghex';
382
383tag(['test'], 'node_name');
384// ["test", .tag = "node_name"]
385```
386
387### Tagged Template Parsing
388
389Any grammar in RegHex can also be used to parse a tagged template literal.
390A tagged template literal consists of a list of literals alternating with
391a list of "interpolations".
392
393In RegHex we can add an `interpolation` matcher to our grammars to allow it
394to parse interpolations in a template literal.
395
396```js
397import { interpolation } from 'reghex';
398
399const anyNumber = interpolation((x) => typeof x === 'number');
400
401const num = match('num')`
402  ${/[+-]?/} ${anyNumber}
403`;
404
405parse(num)`+${42}`;
406// ["+", 42, .tag = "num"]
407```
408
409This grammar now allows us to match arbitrary values if they're input into the
410parser. We can now call our grammar using a tagged template literal themselves
411to parse this.
412
413**That's it! May the RegExp be ever in your favor.**