Mirror: The magical sticky regex-based parser generator 馃
1<div align="center"> 2 <img alt="reghex" width="250" src="docs/reghex-logo.png" /> 3 <br /> 4 <br /> 5 <strong> 6 The magical sticky regex-based parser generator 7 </strong> 8 <br /> 9 <br /> 10 <br /> 11</div> 12 13Leveraging the power of sticky regexes and Babel code generation, `reghex` allows 14you to code parsers quickly, by surrounding regular expressions with a regex-like 15[DSL](https://en.wikipedia.org/wiki/Domain-specific_language). 16 17With `reghex` you can generate a parser from a tagged template literal, which is 18quick to prototype and generates reasonably compact and performant code. 19 20_This project is still in its early stages and is experimental. Its API may still 21change and some issues may need to be ironed out._ 22 23## Quick Start 24 25##### 1. Install with yarn or npm 26 27```sh 28yarn add reghex 29# or 30npm install --save reghex 31``` 32 33##### 2. Add the plugin to your Babel configuration (`.babelrc`, `babel.config.js`, or `package.json:babel`) 34 35```json 36{ 37 "plugins": ["reghex/babel"] 38} 39``` 40 41Alternatively, you can set up [`babel-plugin-macros`](https://github.com/kentcdodds/babel-plugin-macros) and 42import `reghex` from `"reghex/macro"` instead. 43 44##### 3. Have fun writing parsers! 45 46```js 47import match, { parse } from 'reghex'; 48 49const name = match('name')` 50 ${/\w+/} 51`; 52 53parse(name)('hello'); 54// [ "hello", .tag = "name" ] 55``` 56 57## Concepts 58 59The fundamental concept of `reghex` are regexes, specifically 60[sticky regexes](https://www.loganfranken.com/blog/831/es6-everyday-sticky-regex-matches/)! 61These are regular expressions that don't search a target string, but instead match at the 62specific position they're at. The flag for sticky regexes is `y` and hence 63they can be created using `/phrase/y` or `new RegExp('phrase', 'y')`. 64 65**Sticky Regexes** are the perfect foundation for a parsing framework in JavaScript! 66Because they only match at a single position they can be used to match patterns 67continuously, as a parser would. Like global regexes, we can then manipulate where 68they should be matched by setting `regex.lastIndex = index;` and after matching 69read back their updated `regex.lastIndex`. 70 71> **Note:** Sticky Regexes aren't natively 72> [supported in all versions of Internet Explorer](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/sticky#Browser_compatibility). `reghex` works around this by imitating its behaviour, which may decrease performance on IE11. 73 74This primitive allows us to build up a parser from regexes that you pass when 75authoring a parser function, also called a "matcher" in `reghex`. When `reghex` compiles 76to parser code, this code is just a sequence and combination of sticky regexes that 77are executed in order! 78 79```js 80let input = 'phrases should be parsed...'; 81let lastIndex = 0; 82 83const regex = /phrase/y; 84function matcher() { 85 let match; 86 // Before matching we set the current index on the RegExp 87 regex.lastIndex = lastIndex; 88 // Then we match and store the result 89 if ((match = regex.exec(input))) { 90 // If the RegExp matches successfully, we update our lastIndex 91 lastIndex = regex.lastIndex; 92 } 93} 94``` 95 96This mechanism is used in all matcher functions that `reghex` generates. 97Internally `reghex` keeps track of the input string and the current index on 98that string, and the matcher functions execute regexes against this state. 99 100## Authoring Guide 101 102You can write "matchers" by importing the default import from `reghex` and 103using it to write a matcher expression. 104 105```js 106import match from 'reghex'; 107 108const name = match('name')` 109 ${/\w+/} 110`; 111``` 112 113As can be seen above, the `match` function, which is what we've called the 114default import, is called with a "node name" and is then called as a tagged 115template. This template is our **parsing definition**. 116 117`reghex` functions only with its Babel plugin, which will detect `match('name')` 118and replace the entire tag with a parsing function, which may then look like 119the following in your transpiled code: 120 121```js 122import { _pattern /* ... */ } from 'reghex'; 123 124var _name_expression = _pattern(/\w+/); 125var name = function name() { 126 /* ... */ 127}; 128``` 129 130We've now successfully created a matcher, which matches a single regex, which 131is a pattern of one or more letters. We can execute this matcher by calling 132it with the curried `parse` utility: 133 134```js 135import { parse } from 'reghex'; 136 137const result = parse(name)('Tim'); 138 139console.log(result); // [ "Tim", .tag = "name" ] 140console.log(result.tag); // "name" 141``` 142 143If the string (Here: "Tim") was parsed successfully by the matcher, it will 144return an array that contains the result of the regex. The array is special 145in that it will also have a `tag` property set to the matcher's name, here 146`"name"`, which we determined when we defined the matcher as `match('name')`. 147 148```js 149import { parse } from 'reghex'; 150parse(name)('42'); // undefined 151``` 152 153Similarly, if the matcher does not parse an input string successfully, it will 154return `undefined` instead. 155 156### Nested matchers 157 158This on its own is nice, but a parser must be able to traverse a string and 159turn it into an [Abstract Syntax Tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree). 160To introduce nesting to `reghex` matchers, we can refer to one matcher in another! 161Let's extend our original example; 162 163```js 164import match from 'reghex'; 165 166const name = match('name')` 167 ${/\w+/} 168`; 169 170const hello = match('hello')` 171 ${/hello /} ${name} 172`; 173``` 174 175The new `hello` matcher is set to match `/hello /` and then attempts to match 176the `name` matcher afterwards. If either of these matchers fail, it will return 177`undefined` as well and roll back its changes. Using this matcher will give us 178**nested abstract output**. 179 180We can also see in this example that _outside_ of the regex interpolations, 181whitespaces and newlines don't matter. 182 183```js 184import { parse } from 'reghex'; 185 186parse(hello)('hello tim'); 187/* 188 [ 189 "hello", 190 ["tim", .tag = "name"], 191 .tag = "hello" 192 ] 193*/ 194``` 195 196### Regex-like DSL 197 198We've seen in the previous examples that matchers are authored using tagged 199template literals, where interpolations can either be filled using regexes, 200`${/pattern/}`, or with other matchers `${name}`. 201 202The tagged template syntax supports more ways to match these interpolations, 203using a regex-like Domain Specific Language. Unlike in regexes, whitespaces 204and newlines don't matter to make it easier to format and read matchers. 205 206We can create **sequences** of matchers by adding multiple expressions in 207a row. A matcher using `${/1/} ${/2/}` will attempt to match `1` and then `2` 208in the parsed string. This is just one feature of the regex-like DSL. The 209available operators are the following: 210 211| Operator | Example | Description | 212| -------- | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 213| `?` | `${/1/}?` | An **optional** may be used to make an interpolation optional. This will mean that the interpolation may or may not match. | 214| `*` | `${/1/}*` | A **star** can be used to match an arbitrary amount of interpolation or none at all. This will mean that the interpolation may repeat itself or may not be matched at all. | 215| `+` | `${/1/}+` | A **plus** is used like `*` and must match one or more times. When the matcher doesn't match, that's considered a failing case, since the match isn't optional. | 216| `\|` | `${/1/} \| ${/2/}` | An **alternation** can be used to match either one thing or another, falling back when the first interpolation fails. | 217| `()` | `(${/1/} ${/2/})+` | A **group** can be used apply one of the other operators to an entire group of interpolations. | 218| `(?: )` | `(?: ${/1/})` | A **non-capturing group** is like a regular group, but whatever the interpolations inside it will match, won't appear in the parser's output. | 219| `(?= )` | `(?= ${/1/})` | A **positive lookahead** will check whether interpolations match, and if so will continue the matcher without changing the input. If it matches it's essentially ignored. | 220| `(?! )` | `(?! ${/1/})` | A **negative lookahead** will check whether interpolations _don't_ match, and if so will continue the matcher without changing the input. If the interpolations do match the mathcer will be aborted. | 221 222We can combine and compose these operators to create more complex matchers. 223For instance, we can extend the original example to only allow a specific set 224of names by using the `|` operator: 225 226```js 227const name = match('name')` 228 ${/tim/} | ${/tom/} | ${/tam/} 229`; 230 231parse(name)('tim'); // [ "tim", .tag = "name" ] 232parse(name)('tom'); // [ "tom", .tag = "name" ] 233parse(name)('patrick'); // undefined 234``` 235 236The above will now only match specific name strings. When one pattern in this 237chain of **alternations** does not match, it will try the next one. 238 239We can also use **groups** to add more matchers around the alternations themselves, 240by surrounding the alternations with `(` and `)` 241 242```js 243const name = match('name')` 244 (${/tim/} | ${/tom/}) ${/!/} 245`; 246 247parse(name)('tim!'); // [ "tim", "!", .tag = "name" ] 248parse(name)('tom!'); // [ "tom", "!", .tag = "name" ] 249parse(name)('tim'); // undefined 250``` 251 252Maybe we're also not that interested in the `"!"` showing up in the output node. 253If we want to get rid of it, we can use a **non-capturing group** to hide it, 254while still requiring it. 255 256```js 257const name = match('name')` 258 (${/tim/} | ${/tom/}) (?: ${/!/}) 259`; 260 261parse(name)('tim!'); // [ "tim", .tag = "name" ] 262parse(name)('tim'); // undefined 263``` 264 265Lastly, like with regexex `?`, `*`, and `+` may be used as "quantifiers". The first two 266may also be optional and _not_ match their patterns without the matcher failing. 267The `+` operator is used to match an interpolation _one or more_ times, while the 268`*` operators may match _zero or more_ times. Let's use this to allow the `"!"` 269to repeat. 270 271```js 272const name = match('name')` 273 (${/tim/} | ${/tom/})+ (?: ${/!/})* 274`; 275 276parse(name)('tim!'); // [ "tim", .tag = "name" ] 277parse(name)('tim!!!!'); // [ "tim", .tag = "name" ] 278parse(name)('tim'); // [ "tim", .tag = "name" ] 279parse(name)('timtim'); // [ "tim", tim", .tag = "name" ] 280``` 281 282As we can see from the above, like in regexes, quantifiers can be combined with groups, 283non-capturing groups, or other groups. 284 285### Transforming as we match 286 287In the previous sections, we've seen that the **nodes** that `reghex` outputs are arrays containing 288match strings or other nodes and have a special `tag` property with the node's type. 289We can **change this output** while we're parsing by passing a second function to our matcher definition. 290 291```js 292const name = match('name', (x) => x[0])` 293 (${/tim/} | ${/tom/}) ${/!/} 294`; 295 296parse(name)('tim'); // "tim" 297``` 298 299In the above example, we're passing a small function, `x => x[0]` to the matcher as a 300second argument. This will change the matcher's output, which causes the parser to 301now return a new output for this matcher. 302 303We can use this function creatively by outputting full AST nodes, maybe like the 304ones even that resemble Babel's output: 305 306```js 307const identifier = match('identifier', (x) => ({ 308 type: 'Identifier', 309 name: x[0], 310}))` 311 ${/[\w_][\w\d_]+/} 312`; 313 314parse(name)('var_name'); // { type: "Identifier", name: "var_name" } 315``` 316 317We've now entirely changed the output of the parser for this matcher. Given that each 318matcher can change its output, we're free to change the parser's output entirely. 319By **returning a falsy** in this matcher, we can also change the matcher to not have 320matched, which would cause other matchers to treat it like a mismatch! 321 322```js 323import match, { parse } from 'reghex'; 324 325const name = match('name')((x) => { 326 return x[0] !== 'tim' ? x : undefined; 327})` 328 ${/\w+/} 329`; 330 331const hello = match('hello')` 332 ${/hello /} ${name} 333`; 334 335parse(name)('tom'); // ["hello", ["tom", .tag = "name"], .tag = "hello"] 336parse(name)('tim'); // undefined 337``` 338 339Lastly, if we need to create these special array nodes ourselves, we can use `reghex`'s 340`tag` export for this purpose. 341 342```js 343import { tag } from 'reghex'; 344 345tag(['test'], 'node_name'); 346// ["test", .tag = "node_name"] 347``` 348 349**That's it! May the RegExp be ever in your favor.**