Mirror: The magical sticky regex-based parser generator 馃
1<div align="center"> 2 <img alt="reghex" width="250" src="docs/reghex-logo.png" /> 3 <br /> 4 <br /> 5 <strong> 6 The magical sticky regex-based parser generator 7 </strong> 8 <br /> 9 <br /> 10 <br /> 11</div> 12 13Leveraging the power of sticky regexes and JS code generation, `reghex` allows 14you to code parsers quickly, by surrounding regular expressions with a regex-like 15[DSL](https://en.wikipedia.org/wiki/Domain-specific_language). 16 17With `reghex` you can generate a parser from a tagged template literal, which is 18quick to prototype and generates reasonably compact and performant code. 19 20_This project is still in its early stages and is experimental. Its API may still 21change and some issues may need to be ironed out._ 22 23## Quick Start 24 25##### 1. Install with yarn or npm 26 27```sh 28yarn add reghex 29# or 30npm install --save reghex 31``` 32 33##### 2. Add the plugin to your Babel configuration _(optional)_ 34 35In your `.babelrc`, `babel.config.js`, or `package.json:babel` add: 36 37```json 38{ 39 "plugins": ["reghex/babel"] 40} 41``` 42 43Alternatively, you can set up [`babel-plugin-macros`](https://github.com/kentcdodds/babel-plugin-macros) and 44import `reghex` from `"reghex/macro"` instead. 45 46This step is **optional**. `reghex` can also generate its optimised JS code during runtime. 47This will only incur a tiny parsing cost on initialisation, but due to the JIT of modern 48JS engines there won't be any difference in performance between pre-compiled and compiled 49versions otherwise. 50 51Since the `reghex` runtime is rather small, for larger grammars it may even make sense not 52to precompile the matchers at all. For this case you may pass the `{ "codegen": false }` 53option to the Babel plugin, which will minify the `reghex` matcher templates without 54precompiling them. 55 56##### 3. Have fun writing parsers! 57 58```js 59import { match, parse } from 'reghex'; 60 61const name = match('name')` 62 ${/\w+/} 63`; 64 65parse(name)('hello'); 66// [ "hello", .tag = "name" ] 67``` 68 69## Concepts 70 71The fundamental concept of `reghex` are regexes, specifically 72[sticky regexes](https://www.loganfranken.com/blog/831/es6-everyday-sticky-regex-matches/)! 73These are regular expressions that don't search a target string, but instead match at the 74specific position they're at. The flag for sticky regexes is `y` and hence 75they can be created using `/phrase/y` or `new RegExp('phrase', 'y')`. 76 77**Sticky Regexes** are the perfect foundation for a parsing framework in JavaScript! 78Because they only match at a single position they can be used to match patterns 79continuously, as a parser would. Like global regexes, we can then manipulate where 80they should be matched by setting `regex.lastIndex = index;` and after matching 81read back their updated `regex.lastIndex`. 82 83> **Note:** Sticky Regexes aren't natively 84> [supported in any versions of Internet Explorer](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/sticky#Browser_compatibility). `reghex` works around this by imitating its behaviour, which may decrease performance on IE11. 85 86This primitive allows us to build up a parser from regexes that you pass when 87authoring a parser function, also called a "matcher" in `reghex`. When `reghex` compiles 88to parser code, this code is just a sequence and combination of sticky regexes that 89are executed in order! 90 91```js 92let input = 'phrases should be parsed...'; 93let lastIndex = 0; 94 95const regex = /phrase/y; 96function matcher() { 97 let match; 98 // Before matching we set the current index on the RegExp 99 regex.lastIndex = lastIndex; 100 // Then we match and store the result 101 if ((match = regex.exec(input))) { 102 // If the RegExp matches successfully, we update our lastIndex 103 lastIndex = regex.lastIndex; 104 } 105} 106``` 107 108This mechanism is used in all matcher functions that `reghex` generates. 109Internally `reghex` keeps track of the input string and the current index on 110that string, and the matcher functions execute regexes against this state. 111 112## Authoring Guide 113 114You can write "matchers" by importing the `match` import from `reghex` and 115using it to write a matcher expression. 116 117```js 118import { match } from 'reghex'; 119 120const name = match('name')` 121 ${/\w+/} 122`; 123``` 124 125As can be seen above, the `match` function, is called with a "node name" and 126is then called as a tagged template. This template is our **parsing definition**. 127 128`reghex` functions only with its Babel plugin, which will detect `match('name')` 129and replace the entire tag with a parsing function, which may then look like 130the following in your transpiled code: 131 132```js 133import { _pattern /* ... */ } from 'reghex'; 134 135var _name_expression = _pattern(/\w+/); 136var name = function name() { 137 /* ... */ 138}; 139``` 140 141We've now successfully created a matcher, which matches a single regex, which 142is a pattern of one or more letters. We can execute this matcher by calling 143it with the curried `parse` utility: 144 145```js 146import { parse } from 'reghex'; 147 148const result = parse(name)('Tim'); 149 150console.log(result); // [ "Tim", .tag = "name" ] 151console.log(result.tag); // "name" 152``` 153 154If the string (Here: "Tim") was parsed successfully by the matcher, it will 155return an array that contains the result of the regex. The array is special 156in that it will also have a `tag` property set to the matcher's name, here 157`"name"`, which we determined when we defined the matcher as `match('name')`. 158 159```js 160import { parse } from 'reghex'; 161parse(name)('42'); // undefined 162``` 163 164Similarly, if the matcher does not parse an input string successfully, it will 165return `undefined` instead. 166 167### Nested matchers 168 169This on its own is nice, but a parser must be able to traverse a string and 170turn it into an [Abstract Syntax Tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree). 171To introduce nesting to `reghex` matchers, we can refer to one matcher in another! 172Let's extend our original example; 173 174```js 175import { match } from 'reghex'; 176 177const name = match('name')` 178 ${/\w+/} 179`; 180 181const hello = match('hello')` 182 ${/hello /} ${name} 183`; 184``` 185 186The new `hello` matcher is set to match `/hello /` and then attempts to match 187the `name` matcher afterwards. If either of these matchers fail, it will return 188`undefined` as well and roll back its changes. Using this matcher will give us 189**nested abstract output**. 190 191We can also see in this example that _outside_ of the regex interpolations, 192whitespace and newlines don't matter. 193 194```js 195import { parse } from 'reghex'; 196 197parse(hello)('hello tim'); 198/* 199 [ 200 "hello", 201 ["tim", .tag = "name"], 202 .tag = "hello" 203 ] 204*/ 205``` 206 207Furthermore, interpolations don't have to just be RegHex matchers. They can 208also be functions returning matchers or completely custom matching functions. 209This is useful when your DSL becomes _self-referential_, i.e. when one matchers 210start referencing each other forming a loop. To fix this we can create a 211function that returns our root matcher: 212 213```js 214import { match } from 'reghex'; 215 216const value = match('value')` 217 (${/\w+/} | ${() => root})+ 218`; 219 220const root = match('root')` 221 ${/root/}+ ${value} 222`; 223``` 224 225### Regex-like DSL 226 227We've seen in the previous examples that matchers are authored using tagged 228template literals, where interpolations can either be filled using regexes, 229`${/pattern/}`, or with other matchers `${name}`. 230 231The tagged template syntax supports more ways to match these interpolations, 232using a regex-like Domain Specific Language. Unlike in regexes, whitespace 233and newlines don't matter, which makes it easier to format and read matchers. 234 235We can create **sequences** of matchers by adding multiple expressions in 236a row. A matcher using `${/1/} ${/2/}` will attempt to match `1` and then `2` 237in the parsed string. This is just one feature of the regex-like DSL. The 238available operators are the following: 239 240| Operator | Example | Description | 241| -------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 242| `?` | `${/1/}?` | An **optional** may be used to make an interpolation optional. This means that the interpolation may or may not match. | 243| `*` | `${/1/}*` | A **star** can be used to match an arbitrary amount of interpolation or none at all. This means that the interpolation may repeat itself or may not be matched at all. | 244| `+` | `${/1/}+` | A **plus** is used like `*` and must match one or more times. When the matcher doesn't match, that's considered a failing case, since the match isn't optional. | 245| `\|` | `${/1/} \| ${/2/}` | An **alternation** can be used to match either one thing or another, falling back when the first interpolation fails. | 246| `()` | `(${/1/} ${/2/})+` | A **group** can be used to apply one of the other operators to an entire group of interpolations. | 247| `(?: )` | `(?: ${/1/})` | A **non-capturing group** is like a regular group, but the interpolations matched inside it don't appear in the parser's output. | 248| `(?= )` | `(?= ${/1/})` | A **positive lookahead** checks whether interpolations match, and if so continues the matcher without changing the input. If it matches, it's essentially ignored. | 249| `(?! )` | `(?! ${/1/})` | A **negative lookahead** checks whether interpolations _don't_ match, and if so continues the matcher without changing the input. If the interpolations do match the matcher is aborted. | 250 251A couple of operators also support "short hands" that allow you to write 252lookaheads or non-capturing groups a little quicker. 253 254| Shorthand | Example | Description | 255| --------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 256| `:` | `:${/1/}` | A **non-capturing group** is like a regular group, but the interpolations matched inside it don't appear in the parser's output. | 257| `=` | `=${/1/}` | A **positive lookahead** checks whether interpolations match, and if so continues the matcher without changing the input. If it matches, it's essentially ignored. | 258| `!` | `!${/1/}` | A **negative lookahead** checks whether interpolations _don't_ match, and if so continues the matcher without changing the input. If the interpolations do match the matcher is aborted. | 259 260We can combine and compose these operators to create more complex matchers. 261For instance, we can extend the original example to only allow a specific set 262of names by using the `|` operator: 263 264```js 265const name = match('name')` 266 ${/tim/} | ${/tom/} | ${/tam/} 267`; 268 269parse(name)('tim'); // [ "tim", .tag = "name" ] 270parse(name)('tom'); // [ "tom", .tag = "name" ] 271parse(name)('patrick'); // undefined 272``` 273 274The above will now only match specific name strings. When one pattern in this 275chain of **alternations** does not match, it will try the next one. 276 277We can also use **groups** to add more matchers around the alternations themselves, 278by surrounding the alternations with `(` and `)` 279 280```js 281const name = match('name')` 282 (${/tim/} | ${/tom/}) ${/!/} 283`; 284 285parse(name)('tim!'); // [ "tim", "!", .tag = "name" ] 286parse(name)('tom!'); // [ "tom", "!", .tag = "name" ] 287parse(name)('tim'); // undefined 288``` 289 290Maybe we're also not that interested in the `"!"` showing up in the output node. 291If we want to get rid of it, we can use a **non-capturing group** to hide it, 292while still requiring it. 293 294```js 295const name = match('name')` 296 (${/tim/} | ${/tom/}) (?: ${/!/}) 297`; 298 299parse(name)('tim!'); // [ "tim", .tag = "name" ] 300parse(name)('tim'); // undefined 301``` 302 303Lastly, like with regexes, `?`, `*`, and `+` may be used as "quantifiers". The first two 304may also be optional and _not_ match their patterns without the matcher failing. 305The `+` operator is used to match an interpolation _one or more_ times, while the 306`*` operators may match _zero or more_ times. Let's use this to allow the `"!"` 307to repeat. 308 309```js 310const name = match('name')` 311 (${/tim/} | ${/tom/})+ (?: ${/!/})* 312`; 313 314parse(name)('tim!'); // [ "tim", .tag = "name" ] 315parse(name)('tim!!!!'); // [ "tim", .tag = "name" ] 316parse(name)('tim'); // [ "tim", .tag = "name" ] 317parse(name)('timtim'); // [ "tim", tim", .tag = "name" ] 318``` 319 320As we can see from the above, like in regexes, quantifiers can be combined with groups, 321non-capturing groups, or other groups. 322 323### Transforming as we match 324 325In the previous sections, we've seen that the **nodes** that `reghex` outputs are arrays containing 326match strings or other nodes and have a special `tag` property with the node's type. 327We can **change this output** while we're parsing by passing a function to our matcher definition. 328 329```js 330const name = match('name', (x) => x[0])` 331 (${/tim/} | ${/tom/}) ${/!/} 332`; 333 334parse(name)('tim'); // "tim" 335``` 336 337In the above example, we're passing a small function, `x => x[0]` to the matcher as a 338second argument. This will change the matcher's output, which causes the parser to 339now return a new output for this matcher. 340 341We can use this function creatively by outputting full AST nodes, maybe even like the 342ones that resemble Babel's output: 343 344```js 345const identifier = match('identifier', (x) => ({ 346 type: 'Identifier', 347 name: x[0], 348}))` 349 ${/[\w_][\w\d_]+/} 350`; 351 352parse(name)('var_name'); // { type: "Identifier", name: "var_name" } 353``` 354 355We've now entirely changed the output of the parser for this matcher. Given that each 356matcher can change its output, we're free to change the parser's output entirely. 357By returning `null` or `undefined` in this matcher, we can also change the matcher 358to not have matched, which would cause other matchers to treat it like a mismatch! 359 360```js 361import { match, parse } from 'reghex'; 362 363const name = match('name')((x) => { 364 return x[0] !== 'tim' ? x : undefined; 365})` 366 ${/\w+/} 367`; 368 369const hello = match('hello')` 370 ${/hello /} ${name} 371`; 372 373parse(name)('tom'); // ["hello", ["tom", .tag = "name"], .tag = "hello"] 374parse(name)('tim'); // undefined 375``` 376 377Lastly, if we need to create these special array nodes ourselves, we can use `reghex`'s 378`tag` export for this purpose. 379 380```js 381import { tag } from 'reghex'; 382 383tag(['test'], 'node_name'); 384// ["test", .tag = "node_name"] 385``` 386 387### Tagged Template Parsing 388 389Any grammar in RegHex can also be used to parse a tagged template literal. 390A tagged template literal consists of a list of literals alternating with 391a list of "interpolations". 392 393In RegHex we can add an `interpolation` matcher to our grammars to allow it 394to parse interpolations in a template literal. 395 396```js 397import { interpolation } from 'reghex'; 398 399const anyNumber = interpolation((x) => typeof x === 'number'); 400 401const num = match('num')` 402 ${/[+-]?/} ${anyNumber} 403`; 404 405parse(num)`+${42}`; 406// ["+", 42, .tag = "num"] 407``` 408 409This grammar now allows us to match arbitrary values if they're input into the 410parser. We can now call our grammar using a tagged template literal themselves 411to parse this. 412 413**That's it! May the RegExp be ever in your favor.**