README.md at v2.0.0-beta.2 · kitten.sh/reghex

kitten.sh / reghex
Mirror: The magical sticky regex-based parser generator 🧙
reghex / README.md
at v2.0.0-beta.2 13 kB view raw view rendered
  1<div align="center">
  2  <img alt="reghex" width="250" src="docs/reghex-logo.png" />
  3  <br />
  4  <br />
  5  <strong>
  6    The magical sticky regex-based parser generator
  7  </strong>
  8  <br />
  9  <br />
 10  <br />
 11</div>
 12
 13Leveraging the power of sticky regexes and JS code generation, `reghex` allows
 14you to code parsers quickly, by surrounding regular expressions with a regex-like
 15[DSL](https://en.wikipedia.org/wiki/Domain-specific_language).
 16
 17With `reghex` you can generate a parser from a tagged template literal, which is
 18quick to prototype and generates reasonably compact and performant code.
 19
 20_This project is still in its early stages and is experimental. Its API may still
 21change and some issues may need to be ironed out._
 22
 23## Quick Start
 24
 25##### 1. Install with yarn or npm
 26
 27```sh
 28yarn add reghex
 29# or
 30npm install --save reghex
 31```
 32
 33##### 2. Add the plugin to your Babel configuration _(optional)_
 34
 35In your `.babelrc`, `babel.config.js`, or `package.json:babel` add:
 36
 37```json
 38{
 39  "plugins": ["reghex/babel"]
 40}
 41```
 42
 43Alternatively, you can set up [`babel-plugin-macros`](https://github.com/kentcdodds/babel-plugin-macros) and
 44import `reghex` from `"reghex/macro"` instead.
 45
 46This step is **optional**. `reghex` can also generate its optimised JS code during runtime only!
 47
 48##### 3. Have fun writing parsers!
 49
 50```js
 51import { match, parse } from 'reghex';
 52
 53const name = match('name')`
 54  ${/\w+/}
 55`;
 56
 57parse(name)('hello');
 58// [ "hello", .tag = "name" ]
 59```
 60
 61## Concepts
 62
 63The fundamental concept of `reghex` are regexes, specifically
 64[sticky regexes](https://www.loganfranken.com/blog/831/es6-everyday-sticky-regex-matches/)!
 65These are regular expressions that don't search a target string, but instead match at the
 66specific position they're at. The flag for sticky regexes is `y` and hence
 67they can be created using `/phrase/y` or `new RegExp('phrase', 'y')`.
 68
 69**Sticky Regexes** are the perfect foundation for a parsing framework in JavaScript!
 70Because they only match at a single position they can be used to match patterns
 71continuously, as a parser would. Like global regexes, we can then manipulate where
 72they should be matched by setting `regex.lastIndex = index;` and after matching
 73read back their updated `regex.lastIndex`.
 74
 75> **Note:** Sticky Regexes aren't natively
 76> [supported in any versions of Internet Explorer](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/sticky#Browser_compatibility). `reghex` works around this by imitating its behaviour, which may decrease performance on IE11.
 77
 78This primitive allows us to build up a parser from regexes that you pass when
 79authoring a parser function, also called a "matcher" in `reghex`. When `reghex` compiles
 80to parser code, this code is just a sequence and combination of sticky regexes that
 81are executed in order!
 82
 83```js
 84let input = 'phrases should be parsed...';
 85let lastIndex = 0;
 86
 87const regex = /phrase/y;
 88function matcher() {
 89  let match;
 90  // Before matching we set the current index on the RegExp
 91  regex.lastIndex = lastIndex;
 92  // Then we match and store the result
 93  if ((match = regex.exec(input))) {
 94    // If the RegExp matches successfully, we update our lastIndex
 95    lastIndex = regex.lastIndex;
 96  }
 97}
 98```
 99
100This mechanism is used in all matcher functions that `reghex` generates.
101Internally `reghex` keeps track of the input string and the current index on
102that string, and the matcher functions execute regexes against this state.
103
104## Authoring Guide
105
106You can write "matchers" by importing the `match` import from `reghex` and
107using it to write a matcher expression.
108
109```js
110import { match } from 'reghex';
111
112const name = match('name')`
113  ${/\w+/}
114`;
115```
116
117As can be seen above, the `match` function, is called with a "node name" and
118is then called as a tagged template. This template is our **parsing definition**.
119
120`reghex` functions only with its Babel plugin, which will detect `match('name')`
121and replace the entire tag with a parsing function, which may then look like
122the following in your transpiled code:
123
124```js
125import { _pattern /* ... */ } from 'reghex';
126
127var _name_expression = _pattern(/\w+/);
128var name = function name() {
129  /* ... */
130};
131```
132
133We've now successfully created a matcher, which matches a single regex, which
134is a pattern of one or more letters. We can execute this matcher by calling
135it with the curried `parse` utility:
136
137```js
138import { parse } from 'reghex';
139
140const result = parse(name)('Tim');
141
142console.log(result); // [ "Tim", .tag = "name" ]
143console.log(result.tag); // "name"
144```
145
146If the string (Here: "Tim") was parsed successfully by the matcher, it will
147return an array that contains the result of the regex. The array is special
148in that it will also have a `tag` property set to the matcher's name, here
149`"name"`, which we determined when we defined the matcher as `match('name')`.
150
151```js
152import { parse } from 'reghex';
153parse(name)('42'); // undefined
154```
155
156Similarly, if the matcher does not parse an input string successfully, it will
157return `undefined` instead.
158
159### Nested matchers
160
161This on its own is nice, but a parser must be able to traverse a string and
162turn it into an [Abstract Syntax Tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree).
163To introduce nesting to `reghex` matchers, we can refer to one matcher in another!
164Let's extend our original example;
165
166```js
167import { match } from 'reghex';
168
169const name = match('name')`
170  ${/\w+/}
171`;
172
173const hello = match('hello')`
174  ${/hello /} ${name}
175`;
176```
177
178The new `hello` matcher is set to match `/hello /` and then attempts to match
179the `name` matcher afterwards. If either of these matchers fail, it will return
180`undefined` as well and roll back its changes. Using this matcher will give us
181**nested abstract output**.
182
183We can also see in this example that _outside_ of the regex interpolations,
184whitespace and newlines don't matter.
185
186```js
187import { parse } from 'reghex';
188
189parse(hello)('hello tim');
190/*
191  [
192    "hello",
193    ["tim", .tag = "name"],
194    .tag = "hello"
195  ]
196*/
197```
198
199Furthermore, interpolations don't have to just be RegHex matchers. They can
200also be functions returning matchers or completely custom matching functions.
201This is useful when your DSL becomes _self-referential_, i.e. when one matchers
202start referencing each other forming a loop. To fix this we can create a
203function that returns our root matcher:
204
205```js
206import { match } from 'reghex';
207
208const value = match('value')`
209  (${/\w+/} | ${() => root})+
210`;
211
212const root = match('root')`
213  ${/root/}+ ${value}
214`;
215```
216
217### Regex-like DSL
218
219We've seen in the previous examples that matchers are authored using tagged
220template literals, where interpolations can either be filled using regexes,
221`${/pattern/}`, or with other matchers `${name}`.
222
223The tagged template syntax supports more ways to match these interpolations,
224using a regex-like Domain Specific Language. Unlike in regexes, whitespace
225and newlines don't matter, which makes it easier to format and read matchers.
226
227We can create **sequences** of matchers by adding multiple expressions in
228a row. A matcher using `${/1/} ${/2/}` will attempt to match `1` and then `2`
229in the parsed string. This is just one feature of the regex-like DSL. The
230available operators are the following:
231
232| Operator | Example            | Description                                                                                                                                                                              |
233| -------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
234| `?`      | `${/1/}?`          | An **optional** may be used to make an interpolation optional. This means that the interpolation may or may not match.                                                                   |
235| `*`      | `${/1/}*`          | A **star** can be used to match an arbitrary amount of interpolation or none at all. This means that the interpolation may repeat itself or may not be matched at all.                   |
236| `+`      | `${/1/}+`          | A **plus** is used like `*` and must match one or more times. When the matcher doesn't match, that's considered a failing case, since the match isn't optional.                          |
237| `\|`     | `${/1/} \| ${/2/}` | An **alternation** can be used to match either one thing or another, falling back when the first interpolation fails.                                                                    |
238| `()`     | `(${/1/} ${/2/})+` | A **group** can be used to apply one of the other operators to an entire group of interpolations.                                                                                        |
239| `(?: )`  | `(?: ${/1/})`      | A **non-capturing group** is like a regular group, but the interpolations matched inside it don't appear in the parser's output.                                                         |
240| `(?= )`  | `(?= ${/1/})`      | A **positive lookahead** checks whether interpolations match, and if so continues the matcher without changing the input. If it matches, it's essentially ignored.                       |
241| `(?! )`  | `(?! ${/1/})`      | A **negative lookahead** checks whether interpolations _don't_ match, and if so continues the matcher without changing the input. If the interpolations do match the matcher is aborted. |
242
243We can combine and compose these operators to create more complex matchers.
244For instance, we can extend the original example to only allow a specific set
245of names by using the `|` operator:
246
247```js
248const name = match('name')`
249  ${/tim/} | ${/tom/} | ${/tam/}
250`;
251
252parse(name)('tim'); // [ "tim", .tag = "name" ]
253parse(name)('tom'); // [ "tom", .tag = "name" ]
254parse(name)('patrick'); // undefined
255```
256
257The above will now only match specific name strings. When one pattern in this
258chain of **alternations** does not match, it will try the next one.
259
260We can also use **groups** to add more matchers around the alternations themselves,
261by surrounding the alternations with `(` and `)`
262
263```js
264const name = match('name')`
265  (${/tim/} | ${/tom/}) ${/!/}
266`;
267
268parse(name)('tim!'); // [ "tim", "!", .tag = "name" ]
269parse(name)('tom!'); // [ "tom", "!", .tag = "name" ]
270parse(name)('tim'); // undefined
271```
272
273Maybe we're also not that interested in the `"!"` showing up in the output node.
274If we want to get rid of it, we can use a **non-capturing group** to hide it,
275while still requiring it.
276
277```js
278const name = match('name')`
279  (${/tim/} | ${/tom/}) (?: ${/!/})
280`;
281
282parse(name)('tim!'); // [ "tim", .tag = "name" ]
283parse(name)('tim'); // undefined
284```
285
286Lastly, like with regexes, `?`, `*`, and `+` may be used as "quantifiers". The first two
287may also be optional and _not_ match their patterns without the matcher failing.
288The `+` operator is used to match an interpolation _one or more_ times, while the
289`*` operators may match _zero or more_ times. Let's use this to allow the `"!"`
290to repeat.
291
292```js
293const name = match('name')`
294  (${/tim/} | ${/tom/})+ (?: ${/!/})*
295`;
296
297parse(name)('tim!'); // [ "tim", .tag = "name" ]
298parse(name)('tim!!!!'); // [ "tim", .tag = "name" ]
299parse(name)('tim'); // [ "tim", .tag = "name" ]
300parse(name)('timtim'); // [ "tim", tim", .tag = "name" ]
301```
302
303As we can see from the above, like in regexes, quantifiers can be combined with groups,
304non-capturing groups, or other groups.
305
306### Transforming as we match
307
308In the previous sections, we've seen that the **nodes** that `reghex` outputs are arrays containing
309match strings or other nodes and have a special `tag` property with the node's type.
310We can **change this output** while we're parsing by passing a function to our matcher definition.
311
312```js
313const name = match('name', (x) => x[0])`
314  (${/tim/} | ${/tom/}) ${/!/}
315`;
316
317parse(name)('tim'); // "tim"
318```
319
320In the above example, we're passing a small function, `x => x[0]` to the matcher as a
321second argument. This will change the matcher's output, which causes the parser to
322now return a new output for this matcher.
323
324We can use this function creatively by outputting full AST nodes, maybe even like the
325ones that resemble Babel's output:
326
327```js
328const identifier = match('identifier', (x) => ({
329  type: 'Identifier',
330  name: x[0],
331}))`
332  ${/[\w_][\w\d_]+/}
333`;
334
335parse(name)('var_name'); // { type: "Identifier", name: "var_name" }
336```
337
338We've now entirely changed the output of the parser for this matcher. Given that each
339matcher can change its output, we're free to change the parser's output entirely.
340By **returning a falsy value** in this matcher, we can also change the matcher to not have
341matched, which would cause other matchers to treat it like a mismatch!
342
343```js
344import { match, parse } from 'reghex';
345
346const name = match('name')((x) => {
347  return x[0] !== 'tim' ? x : undefined;
348})`
349  ${/\w+/}
350`;
351
352const hello = match('hello')`
353  ${/hello /} ${name}
354`;
355
356parse(name)('tom'); // ["hello", ["tom", .tag = "name"], .tag = "hello"]
357parse(name)('tim'); // undefined
358```
359
360Lastly, if we need to create these special array nodes ourselves, we can use `reghex`'s
361`tag` export for this purpose.
362
363```js
364import { tag } from 'reghex';
365
366tag(['test'], 'node_name');
367// ["test", .tag = "node_name"]
368```
369
370**That's it! May the RegExp be ever in your favor.**