Constellation, Spacedust, Slingshot, UFOs: atproto crates and services for microcosm
1# constellation 🌌 2 3A global atproto backlink index ✨ 4 5- Self hostable: handles the full write throughput of the global atproto firehose on a raspberry pi 4b + single SSD 6- Storage efficient: less than 2GB/day disk consumption indexing all references in all lexicons and all non-atproto URLs 7- Handles record deletion, account de/re-activation, and account deletion, ensuring accurate link counts and respecting users data choices 8- Simple JSON API 9 10All social interactions in atproto tend to be represented by links (or references) between PDS records. This index can answer questions like "how many likes does a bsky post have", "who follows an account", "what are all the comments on a [frontpage](https://frontpage.fyi/) post", and more. 11 12- **status**: works! api is unstable and likely to change, and no known instances have a full network backfill yet. 13- source: [./constellation/](./constellation/) 14- public instance: [constellation.microcosm.blue](https://constellation.microcosm.blue/) 15 16_note: the public instance currently runs on a little raspberry pi in my house, feel free to use it! it comes with only with best-effort uptime, no commitment to not breaking the api for now, and possible rate-limiting. if you want to be nice you can put your project name and bsky username (or email) in your user-agent header for api requests._ 17 18 19## API endpoints 20 21currently this is a bit out of date -- refer to the [api docs hosted by the app itself](https://constellation.microcosm.blue/) for now. they also let you try out live requests. 22 23terms as used here: 24 25- "URI": a URI, AT-URI, or DID. 26- "JSON path": a dot-separated (and dot-prefixed, for now) path to a field in an atproto record. Arrays are noted by `[]` and cannot contain a specific index. 27 28### `GET /links/count` 29 30The number of backlinks to a URI from a specified collection + json path. 31 32#### Required URL parameters 33 34- `target` (required): the URI. must be URL-encoded. 35 - example: `at%3A%2F%2Fdid%3Aplc%3A57vlzz2egy6eqr4nksacmbht%2Fapp.bsky.feed.post%2F3lg2pgq3gq22b` 36- `collection` (required): the source NSID of referring documents to consider. 37 - example: `app.bsky.feed.post` 38- `path` (required): the JSON path in referring documents to consider. 39 - example: `.subject.uri` 40 41#### Response 42 43A number (u64) in plain text format 44 45#### cURL example: Get a count of all bluesky likes for a post 46 47```bash 48curl '<HOST>/links/count?target=at%3A%2F%2Fdid%3Aplc%3A57vlzz2egy6eqr4nksacmbht%2Fapp.bsky.feed.post%2F3lg2pgq3gq22b&collection=app.bsky.feed.like&path=.subject.uri' 49 5040 51``` 52 53### `GET /links/all/count` 54 55The number of backlinks to a URI from any source collection or json path 56 57#### Required URL parameters 58 59- `target` (required): the URI. must be URL-encoded. 60 - example: `did:plc:vc7f4oafdgxsihk4cry2xpze` 61 62#### Response 63 64A JSON object `{[NSID]: {[JSON path]: [N]}}` 65 66#### cURL example: Get reference counts to a DID from any collection at any path 67 68```bash 69curl '<HOST>/links/all/count?target=did:plc:vc7f4oafdgxsihk4cry2xpze' 70 71curl '<HOST>/links/all/count?target=did:plc:vc7f4oafdgxsihk4cry2xpze' 72{ 73 "app.bsky.graph.block": { ".subject": 13 }, 74 "app.bsky.graph.follow": { ".subject": 159 }, 75 "app.bsky.feed.post": { ".facets[].features[].did": 16 }, 76 "app.bsky.graph.listitem": { ".subject": 6 }, 77 "app.bsky.graph.starterpack": 78 { 79 ".feeds[].creator.did": 1, 80 ".feeds[].creator.labels[].src": 1 81 } 82} 83``` 84 85 86some todos 87 88- [x] find links and write them to rocksdb 89- [x] handle account active status 90- [x] handle account deletion 91- [ ] handle account privacy setting? (is this a bsky-nsid-specific config and should that matter?) 92 - instead of looking this up, should be able to listen for it to be published on the firehose. 93 - this should _work_, but without backfill it won't be accurate. targeted backfill might be an option. 94- [x] move ownership of canonical seq to an owned non-atomic 95- [x] custom path for db storage 96- [x] api server to look up backlink count 97- [~] other useful endpoints for the api server 98 - [x] show all nisd/path links to target 99 - [x] get backlinking dids 100 - [x] paging for all backlinking dids 101 - [x] get count + most recent dids 102 - [ ] get count with any dids from provided set 103- [~] write this readme 104- [?] fix it sometimes getting stuck 105 - seems to unstick in my possibly-different repro (letting laptop fall asleep) after a bit. 106 - [ ] add a detection for no new links coming in after some period 107 - [x] add tcp connect, read, and write timeouts 🤞 108- [x] handle jetstream restart: don't miss events (currently sketch: rewinds cursor by 1us so we will always double-count at least one event) 109 - [x] especially: figure out what the risk is to rotating to another jetstream server in terms of gap/overlap from a different jetstream instance's cursor (follow up separately) 110 - [x] jetstream: don't rotate servers, explicitly pass via cli 111- [x] metrics! 112 - [x] event ts lag 113- [x] machine resource metrics 114 - [x] disk consumption 115 - [x] cpu usage 116 - [x] mem usage 117 - [x] network? 118- [x] make all rocks apis return Result instead of unwrapping 119- [~] handle all the unwraps 120- [ ] deadletter queue of some kind for failed db writes 121 - [ ] also for valid json that was rejected? 122- [x] get it running on raspi 123- [x] get an estimate of disk usage per day after a few days of running 124 - very close to 1GB with data model before adding rkeys to linkers + fixing paths 125- [x] make the did_init check only happen on test config (or remove it) (removed) 126- [ ] actual error types (thiserror?) for lib-ish code 127- [~] clean up the main readme 128- [x] web server metrics 129 - [x] origin and ua labels 130- [ ] tokio metrics? 131- [x] handle shutdown cleanly -- be nice to rocksdb 132- [x] add user-agent to jetstream request 133- [ ] wow the shutdown stuff i wrote is really bad and doesn't work a lot 134- [x] serve html for browser requests 135- [ ] add a health check endpoint 136- [x] add seq numbers to metrics 137- [ ] persist the jetstream server url, error if started with a different one (maybe with --switch-streams or something) 138- [ ] put delete-account tasks into a separate (persisted?) task queue for the writer so it can work on them incrementally. 139- [ ] jetstream: connect retry: only reset counter after some *time* has passed. 140- [x] either count or estimate the total number of links added (distinct from link targets) 141- [x] jetstream: don't crash on connection refused (retry * backoff) 142- [x] allow cors requests (ie. atproto-browser. (but it's really meant for backends)) 143- [x] api: get distinct linking dids (https://bsky.app/profile/bnewbold.net/post/3lhhzejv7zc2h) 144 - [x] endpoint for count 145 - [x] endpoint for listing them 146 - [x] add to exploratory /all endpoint 147- [ ] nginx: support http2 148- [x] nginx metrics 149- [ ] add TimeoutLayer for axum 150- [~] rocksdb metrics 151 - [x] write ops (count? per actionable?) 152 - [x] write time hist 153 - [ ] read ops (api) 154 - [ ] expose internal stats? 155- [ ] figure out what's the right thing to do if merge op fails. happened on startup after an unclean reboot. 156- [x] backups! 157 - [x] manual backup on startup 158 - [x] background task to create backups on an interval 159- [ ] add a low-ulimit check on startup? 160 161cache 162- [ ] set api response headers 163 - [ ] put "stale-while-revalidate" in Cache-Control w/ num seconds 164 - [ ] put "stale-if-error" in Cache-Control w/ num seconds 165 - [ ] set Expires or Cache-Control expires 166 - [ ] add Accept to vary response 167- [ ] cache vary: might need to take bsky account privacy setting into account (unless this ends up being in query) 168 169data fixes 170- [x] add rkey to linkers 🤦‍♀️ 171- [x] don't remove deleted links from the reverse records -- null them out. this will keep things stable for paging. 172- [x] don't show deactivated accounts in link responses 173- [ ] canonicalize handles to dids! 174- [ ] links: 175 - [~] pull `$type`/`type` from object children of arrays (distinguish replies, quotes, etc) 176 - just $type to start 177 - [ ] rewrite the entire "path" stuff 178 - [ ] actually define the format (deal with in-band dots etc) 179 - [x] ~_could_ throw cid neighbour into the target. probably should? but it's a lot of high volume uncompressible bytes~ 180 - and it could be looked up from the linker's doc 181 - ^^ for now, look up from source doc to get cid. might revisit this later.