Mirror: The magical sticky regex-based parser generator 馃
1<div align="center"> 2 <img alt="reghex" width="250" src="docs/reghex-logo.png" /> 3 <br /> 4 <br /> 5 <strong> 6 The magical sticky regex-based parser generator 7 </strong> 8 <br /> 9 <br /> 10 <br /> 11</div> 12 13Leveraging the power of sticky regexes and JS code generation, `reghex` allows 14you to code parsers quickly, by surrounding regular expressions with a regex-like 15[DSL](https://en.wikipedia.org/wiki/Domain-specific_language). 16 17With `reghex` you can generate a parser from a tagged template literal, which is 18quick to prototype and generates reasonably compact and performant code. 19 20_This project is still in its early stages and is experimental. Its API may still 21change and some issues may need to be ironed out._ 22 23## Quick Start 24 25##### 1. Install with yarn or npm 26 27```sh 28yarn add reghex 29# or 30npm install --save reghex 31``` 32 33##### 2. Add the plugin to your Babel configuration _(optional)_ 34 35In your `.babelrc`, `babel.config.js`, or `package.json:babel` add: 36 37```json 38{ 39 "plugins": ["reghex/babel"] 40} 41``` 42 43Alternatively, you can set up [`babel-plugin-macros`](https://github.com/kentcdodds/babel-plugin-macros) and 44import `reghex` from `"reghex/macro"` instead. 45 46This step is **optional**. `reghex` can also generate its optimised JS code during runtime. 47This will only incur a tiny parsing cost on initialisation, but due to the JIT of modern 48JS engines there won't be any difference in performance between pre-compiled and compiled 49versions otherwise. 50 51Since the `reghex` runtime is rather small, for larger grammars it may even make sense not 52to precompile the matchers at all. For this case you may pass the `{ "codegen": false }` 53option to the Babel plugin, which will minify the `reghex` matcher templates without 54precompiling them. 55 56##### 3. Have fun writing parsers! 57 58```js 59import { match, parse } from 'reghex'; 60 61const name = match('name')` 62 ${/\w+/} 63`; 64 65parse(name)('hello'); 66// [ "hello", .tag = "name" ] 67``` 68 69## Concepts 70 71The fundamental concept of `reghex` are regexes, specifically 72[sticky regexes](https://www.loganfranken.com/blog/831/es6-everyday-sticky-regex-matches/)! 73These are regular expressions that don't search a target string, but instead match at the 74specific position they're at. The flag for sticky regexes is `y` and hence 75they can be created using `/phrase/y` or `new RegExp('phrase', 'y')`. 76 77**Sticky Regexes** are the perfect foundation for a parsing framework in JavaScript! 78Because they only match at a single position they can be used to match patterns 79continuously, as a parser would. Like global regexes, we can then manipulate where 80they should be matched by setting `regex.lastIndex = index;` and after matching 81read back their updated `regex.lastIndex`. 82 83> **Note:** Sticky Regexes aren't natively 84> [supported in any versions of Internet Explorer](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/sticky#Browser_compatibility). `reghex` works around this by imitating its behaviour, which may decrease performance on IE11. 85 86This primitive allows us to build up a parser from regexes that you pass when 87authoring a parser function, also called a "matcher" in `reghex`. When `reghex` compiles 88to parser code, this code is just a sequence and combination of sticky regexes that 89are executed in order! 90 91```js 92let input = 'phrases should be parsed...'; 93let lastIndex = 0; 94 95const regex = /phrase/y; 96function matcher() { 97 let match; 98 // Before matching we set the current index on the RegExp 99 regex.lastIndex = lastIndex; 100 // Then we match and store the result 101 if ((match = regex.exec(input))) { 102 // If the RegExp matches successfully, we update our lastIndex 103 lastIndex = regex.lastIndex; 104 } 105} 106``` 107 108This mechanism is used in all matcher functions that `reghex` generates. 109Internally `reghex` keeps track of the input string and the current index on 110that string, and the matcher functions execute regexes against this state. 111 112## Authoring Guide 113 114You can write "matchers" by importing the `match` import from `reghex` and 115using it to write a matcher expression. 116 117```js 118import { match } from 'reghex'; 119 120const name = match('name')` 121 ${/\w+/} 122`; 123``` 124 125As can be seen above, the `match` function, is called with a "node name" and 126is then called as a tagged template. This template is our **parsing definition**. 127 128`reghex` functions only with its Babel plugin, which will detect `match('name')` 129and replace the entire tag with a parsing function, which may then look like 130the following in your transpiled code: 131 132```js 133import { _pattern /* ... */ } from 'reghex'; 134 135var _name_expression = _pattern(/\w+/); 136var name = function name() { 137 /* ... */ 138}; 139``` 140 141We've now successfully created a matcher, which matches a single regex, which 142is a pattern of one or more letters. We can execute this matcher by calling 143it with the curried `parse` utility: 144 145```js 146import { parse } from 'reghex'; 147 148const result = parse(name)('Tim'); 149 150console.log(result); // [ "Tim", .tag = "name" ] 151console.log(result.tag); // "name" 152``` 153 154If the string (Here: "Tim") was parsed successfully by the matcher, it will 155return an array that contains the result of the regex. The array is special 156in that it will also have a `tag` property set to the matcher's name, here 157`"name"`, which we determined when we defined the matcher as `match('name')`. 158 159```js 160import { parse } from 'reghex'; 161parse(name)('42'); // undefined 162``` 163 164Similarly, if the matcher does not parse an input string successfully, it will 165return `undefined` instead. 166 167### Nested matchers 168 169This on its own is nice, but a parser must be able to traverse a string and 170turn it into an [Abstract Syntax Tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree). 171To introduce nesting to `reghex` matchers, we can refer to one matcher in another! 172Let's extend our original example; 173 174```js 175import { match } from 'reghex'; 176 177const name = match('name')` 178 ${/\w+/} 179`; 180 181const hello = match('hello')` 182 ${/hello /} ${name} 183`; 184``` 185 186The new `hello` matcher is set to match `/hello /` and then attempts to match 187the `name` matcher afterwards. If either of these matchers fail, it will return 188`undefined` as well and roll back its changes. Using this matcher will give us 189**nested abstract output**. 190 191We can also see in this example that _outside_ of the regex interpolations, 192whitespace and newlines don't matter. 193 194```js 195import { parse } from 'reghex'; 196 197parse(hello)('hello tim'); 198/* 199 [ 200 "hello", 201 ["tim", .tag = "name"], 202 .tag = "hello" 203 ] 204*/ 205``` 206 207Furthermore, interpolations don't have to just be RegHex matchers. They can 208also be functions returning matchers or completely custom matching functions. 209This is useful when your DSL becomes _self-referential_, i.e. when one matchers 210start referencing each other forming a loop. To fix this we can create a 211function that returns our root matcher: 212 213```js 214import { match } from 'reghex'; 215 216const value = match('value')` 217 (${/\w+/} | ${() => root})+ 218`; 219 220const root = match('root')` 221 ${/root/}+ ${value} 222`; 223``` 224 225### Regex-like DSL 226 227We've seen in the previous examples that matchers are authored using tagged 228template literals, where interpolations can either be filled using regexes, 229`${/pattern/}`, or with other matchers `${name}`. 230 231The tagged template syntax supports more ways to match these interpolations, 232using a regex-like Domain Specific Language. Unlike in regexes, whitespace 233and newlines don't matter, which makes it easier to format and read matchers. 234 235We can create **sequences** of matchers by adding multiple expressions in 236a row. A matcher using `${/1/} ${/2/}` will attempt to match `1` and then `2` 237in the parsed string. This is just one feature of the regex-like DSL. The 238available operators are the following: 239 240| Operator | Example | Description | 241| -------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 242| `?` | `${/1/}?` | An **optional** may be used to make an interpolation optional. This means that the interpolation may or may not match. | 243| `*` | `${/1/}*` | A **star** can be used to match an arbitrary amount of interpolation or none at all. This means that the interpolation may repeat itself or may not be matched at all. | 244| `+` | `${/1/}+` | A **plus** is used like `*` and must match one or more times. When the matcher doesn't match, that's considered a failing case, since the match isn't optional. | 245| `\|` | `${/1/} \| ${/2/}` | An **alternation** can be used to match either one thing or another, falling back when the first interpolation fails. | 246| `()` | `(${/1/} ${/2/})+` | A **group** can be used to apply one of the other operators to an entire group of interpolations. | 247| `(?: )` | `(?: ${/1/})` | A **non-capturing group** is like a regular group, but the interpolations matched inside it don't appear in the parser's output. | 248| `(?= )` | `(?= ${/1/})` | A **positive lookahead** checks whether interpolations match, and if so continues the matcher without changing the input. If it matches, it's essentially ignored. | 249| `(?! )` | `(?! ${/1/})` | A **negative lookahead** checks whether interpolations _don't_ match, and if so continues the matcher without changing the input. If the interpolations do match the matcher is aborted. | 250 251We can combine and compose these operators to create more complex matchers. 252For instance, we can extend the original example to only allow a specific set 253of names by using the `|` operator: 254 255```js 256const name = match('name')` 257 ${/tim/} | ${/tom/} | ${/tam/} 258`; 259 260parse(name)('tim'); // [ "tim", .tag = "name" ] 261parse(name)('tom'); // [ "tom", .tag = "name" ] 262parse(name)('patrick'); // undefined 263``` 264 265The above will now only match specific name strings. When one pattern in this 266chain of **alternations** does not match, it will try the next one. 267 268We can also use **groups** to add more matchers around the alternations themselves, 269by surrounding the alternations with `(` and `)` 270 271```js 272const name = match('name')` 273 (${/tim/} | ${/tom/}) ${/!/} 274`; 275 276parse(name)('tim!'); // [ "tim", "!", .tag = "name" ] 277parse(name)('tom!'); // [ "tom", "!", .tag = "name" ] 278parse(name)('tim'); // undefined 279``` 280 281Maybe we're also not that interested in the `"!"` showing up in the output node. 282If we want to get rid of it, we can use a **non-capturing group** to hide it, 283while still requiring it. 284 285```js 286const name = match('name')` 287 (${/tim/} | ${/tom/}) (?: ${/!/}) 288`; 289 290parse(name)('tim!'); // [ "tim", .tag = "name" ] 291parse(name)('tim'); // undefined 292``` 293 294Lastly, like with regexes, `?`, `*`, and `+` may be used as "quantifiers". The first two 295may also be optional and _not_ match their patterns without the matcher failing. 296The `+` operator is used to match an interpolation _one or more_ times, while the 297`*` operators may match _zero or more_ times. Let's use this to allow the `"!"` 298to repeat. 299 300```js 301const name = match('name')` 302 (${/tim/} | ${/tom/})+ (?: ${/!/})* 303`; 304 305parse(name)('tim!'); // [ "tim", .tag = "name" ] 306parse(name)('tim!!!!'); // [ "tim", .tag = "name" ] 307parse(name)('tim'); // [ "tim", .tag = "name" ] 308parse(name)('timtim'); // [ "tim", tim", .tag = "name" ] 309``` 310 311As we can see from the above, like in regexes, quantifiers can be combined with groups, 312non-capturing groups, or other groups. 313 314### Transforming as we match 315 316In the previous sections, we've seen that the **nodes** that `reghex` outputs are arrays containing 317match strings or other nodes and have a special `tag` property with the node's type. 318We can **change this output** while we're parsing by passing a function to our matcher definition. 319 320```js 321const name = match('name', (x) => x[0])` 322 (${/tim/} | ${/tom/}) ${/!/} 323`; 324 325parse(name)('tim'); // "tim" 326``` 327 328In the above example, we're passing a small function, `x => x[0]` to the matcher as a 329second argument. This will change the matcher's output, which causes the parser to 330now return a new output for this matcher. 331 332We can use this function creatively by outputting full AST nodes, maybe even like the 333ones that resemble Babel's output: 334 335```js 336const identifier = match('identifier', (x) => ({ 337 type: 'Identifier', 338 name: x[0], 339}))` 340 ${/[\w_][\w\d_]+/} 341`; 342 343parse(name)('var_name'); // { type: "Identifier", name: "var_name" } 344``` 345 346We've now entirely changed the output of the parser for this matcher. Given that each 347matcher can change its output, we're free to change the parser's output entirely. 348By **returning a falsy value** in this matcher, we can also change the matcher to not have 349matched, which would cause other matchers to treat it like a mismatch! 350 351```js 352import { match, parse } from 'reghex'; 353 354const name = match('name')((x) => { 355 return x[0] !== 'tim' ? x : undefined; 356})` 357 ${/\w+/} 358`; 359 360const hello = match('hello')` 361 ${/hello /} ${name} 362`; 363 364parse(name)('tom'); // ["hello", ["tom", .tag = "name"], .tag = "hello"] 365parse(name)('tim'); // undefined 366``` 367 368Lastly, if we need to create these special array nodes ourselves, we can use `reghex`'s 369`tag` export for this purpose. 370 371```js 372import { tag } from 'reghex'; 373 374tag(['test'], 'node_name'); 375// ["test", .tag = "node_name"] 376``` 377 378**That's it! May the RegExp be ever in your favor.**