forked from
microcosm.blue/microcosm-rs
Constellation, Spacedust, Slingshot, UFOs: atproto crates and services for microcosm
1# constellation 🌌
2
3A global atproto backlink index ✨
4
5- Self hostable: handles the full write throughput of the global atproto firehose on a raspberry pi 4b + single SSD
6- Storage efficient: less than 2GB/day disk consumption indexing all references in all lexicons and all non-atproto URLs
7- Handles record deletion, account de/re-activation, and account deletion, ensuring accurate link counts and respecting users data choices
8- Simple JSON API
9
10All social interactions in atproto tend to be represented by links (or references) between PDS records. This index can answer questions like "how many likes does a bsky post have", "who follows an account", "what are all the comments on a [frontpage](https://frontpage.fyi/) post", and more.
11
12- **status**: works! api is unstable and likely to change, and no known instances have a full network backfill yet.
13- source: [./constellation/](./constellation/)
14- public instance: [constellation.microcosm.blue](https://constellation.microcosm.blue/)
15
16_note: the public instance currently runs on a little raspberry pi in my house, feel free to use it! it comes with only with best-effort uptime, no commitment to not breaking the api for now, and possible rate-limiting. if you want to be nice you can put your project name and bsky username (or email) in your user-agent header for api requests._
17
18
19## API endpoints
20
21currently this is a bit out of date -- refer to the [api docs hosted by the app itself](https://constellation.microcosm.blue/) for now. they also let you try out live requests.
22
23terms as used here:
24
25- "URI": a URI, AT-URI, or DID.
26- "JSON path": a dot-separated (and dot-prefixed, for now) path to a field in an atproto record. Arrays are noted by `[]` and cannot contain a specific index.
27
28### `GET /links/count`
29
30The number of backlinks to a URI from a specified collection + json path.
31
32#### Required URL parameters
33
34- `target` (required): the URI. must be URL-encoded.
35 - example: `at%3A%2F%2Fdid%3Aplc%3A57vlzz2egy6eqr4nksacmbht%2Fapp.bsky.feed.post%2F3lg2pgq3gq22b`
36- `collection` (required): the source NSID of referring documents to consider.
37 - example: `app.bsky.feed.post`
38- `path` (required): the JSON path in referring documents to consider.
39 - example: `.subject.uri`
40
41#### Response
42
43A number (u64) in plain text format
44
45#### cURL example: Get a count of all bluesky likes for a post
46
47```bash
48curl '<HOST>/links/count?target=at%3A%2F%2Fdid%3Aplc%3A57vlzz2egy6eqr4nksacmbht%2Fapp.bsky.feed.post%2F3lg2pgq3gq22b&collection=app.bsky.feed.like&path=.subject.uri'
49
5040
51```
52
53### `GET /links/all/count`
54
55The number of backlinks to a URI from any source collection or json path
56
57#### Required URL parameters
58
59- `target` (required): the URI. must be URL-encoded.
60 - example: `did:plc:vc7f4oafdgxsihk4cry2xpze`
61
62#### Response
63
64A JSON object `{[NSID]: {[JSON path]: [N]}}`
65
66#### cURL example: Get reference counts to a DID from any collection at any path
67
68```bash
69curl '<HOST>/links/all/count?target=did:plc:vc7f4oafdgxsihk4cry2xpze'
70
71curl '<HOST>/links/all/count?target=did:plc:vc7f4oafdgxsihk4cry2xpze'
72{
73 "app.bsky.graph.block": { ".subject": 13 },
74 "app.bsky.graph.follow": { ".subject": 159 },
75 "app.bsky.feed.post": { ".facets[].features[].did": 16 },
76 "app.bsky.graph.listitem": { ".subject": 6 },
77 "app.bsky.graph.starterpack":
78 {
79 ".feeds[].creator.did": 1,
80 ".feeds[].creator.labels[].src": 1
81 }
82}
83```
84
85
86some todos
87
88- [x] find links and write them to rocksdb
89- [x] handle account active status
90- [x] handle account deletion
91- [ ] handle account privacy setting? (is this a bsky-nsid-specific config and should that matter?)
92 - instead of looking this up, should be able to listen for it to be published on the firehose.
93 - this should _work_, but without backfill it won't be accurate. targeted backfill might be an option.
94- [x] move ownership of canonical seq to an owned non-atomic
95- [x] custom path for db storage
96- [x] api server to look up backlink count
97- [~] other useful endpoints for the api server
98 - [x] show all nisd/path links to target
99 - [x] get backlinking dids
100 - [x] paging for all backlinking dids
101 - [x] get count + most recent dids
102 - [ ] get count with any dids from provided set
103- [~] write this readme
104- [?] fix it sometimes getting stuck
105 - seems to unstick in my possibly-different repro (letting laptop fall asleep) after a bit.
106 - [ ] add a detection for no new links coming in after some period
107 - [x] add tcp connect, read, and write timeouts 🤞
108- [x] handle jetstream restart: don't miss events (currently sketch: rewinds cursor by 1us so we will always double-count at least one event)
109 - [x] especially: figure out what the risk is to rotating to another jetstream server in terms of gap/overlap from a different jetstream instance's cursor (follow up separately)
110 - [x] jetstream: don't rotate servers, explicitly pass via cli
111- [x] metrics!
112 - [x] event ts lag
113- [x] machine resource metrics
114 - [x] disk consumption
115 - [x] cpu usage
116 - [x] mem usage
117 - [x] network?
118- [x] make all rocks apis return Result instead of unwrapping
119- [~] handle all the unwraps
120- [ ] deadletter queue of some kind for failed db writes
121 - [ ] also for valid json that was rejected?
122- [x] get it running on raspi
123- [x] get an estimate of disk usage per day after a few days of running
124 - very close to 1GB with data model before adding rkeys to linkers + fixing paths
125- [x] make the did_init check only happen on test config (or remove it) (removed)
126- [ ] actual error types (thiserror?) for lib-ish code
127- [~] clean up the main readme
128- [x] web server metrics
129 - [x] origin and ua labels
130- [ ] tokio metrics?
131- [x] handle shutdown cleanly -- be nice to rocksdb
132- [x] add user-agent to jetstream request
133- [ ] wow the shutdown stuff i wrote is really bad and doesn't work a lot
134- [x] serve html for browser requests
135- [ ] add a health check endpoint
136- [x] add seq numbers to metrics
137- [ ] persist the jetstream server url, error if started with a different one (maybe with --switch-streams or something)
138- [ ] put delete-account tasks into a separate (persisted?) task queue for the writer so it can work on them incrementally.
139- [ ] jetstream: connect retry: only reset counter after some *time* has passed.
140- [x] either count or estimate the total number of links added (distinct from link targets)
141- [x] jetstream: don't crash on connection refused (retry * backoff)
142- [x] allow cors requests (ie. atproto-browser. (but it's really meant for backends))
143- [x] api: get distinct linking dids (https://bsky.app/profile/bnewbold.net/post/3lhhzejv7zc2h)
144 - [x] endpoint for count
145 - [x] endpoint for listing them
146 - [x] add to exploratory /all endpoint
147- [ ] nginx: support http2
148- [x] nginx metrics
149- [ ] add TimeoutLayer for axum
150- [~] rocksdb metrics
151 - [x] write ops (count? per actionable?)
152 - [x] write time hist
153 - [ ] read ops (api)
154 - [ ] expose internal stats?
155- [ ] figure out what's the right thing to do if merge op fails. happened on startup after an unclean reboot.
156- [x] backups!
157 - [x] manual backup on startup
158 - [x] background task to create backups on an interval
159- [ ] add a low-ulimit check on startup?
160
161cache
162- [ ] set api response headers
163 - [ ] put "stale-while-revalidate" in Cache-Control w/ num seconds
164 - [ ] put "stale-if-error" in Cache-Control w/ num seconds
165 - [ ] set Expires or Cache-Control expires
166 - [ ] add Accept to vary response
167- [ ] cache vary: might need to take bsky account privacy setting into account (unless this ends up being in query)
168
169data fixes
170- [x] add rkey to linkers 🤦♀️
171- [x] don't remove deleted links from the reverse records -- null them out. this will keep things stable for paging.
172- [x] don't show deactivated accounts in link responses
173- [ ] canonicalize handles to dids!
174- [ ] links:
175 - [~] pull `$type`/`type` from object children of arrays (distinguish replies, quotes, etc)
176 - just $type to start
177 - [ ] rewrite the entire "path" stuff
178 - [ ] actually define the format (deal with in-band dots etc)
179 - [x] ~_could_ throw cid neighbour into the target. probably should? but it's a lot of high volume uncompressible bytes~
180 - and it could be looked up from the linker's doc
181 - ^^ for now, look up from source doc to get cid. might revisit this later.