constellation 🌌#
A global atproto backlink index ✨
- Self hostable: handles the full write throughput of the global atproto firehose on a raspberry pi 4b + single SSD
- Storage efficient: less than 2GB/day disk consumption indexing all references in all lexicons and all non-atproto URLs
- Handles record deletion, account de/re-activation, and account deletion, ensuring accurate link counts and respecting users data choices
- Simple JSON API
All social interactions in atproto tend to be represented by links (or references) between PDS records. This index can answer questions like "how many likes does a bsky post have", "who follows an account", "what are all the comments on a frontpage post", and more.
- status: works! api is unstable and likely to change, and no known instances have a full network backfill yet.
- source: ./constellation/
- public instance: constellation.microcosm.blue
note: the public instance currently runs on a little raspberry pi in my house, feel free to use it! it comes with only with best-effort uptime, no commitment to not breaking the api for now, and possible rate-limiting. if you want to be nice you can put your project name and bsky username (or email) in your user-agent header for api requests.
API endpoints#
currently this is a bit out of date -- refer to the api docs hosted by the app itself for now. they also let you try out live requests.
terms as used here:
- "URI": a URI, AT-URI, or DID.
- "JSON path": a dot-separated (and dot-prefixed, for now) path to a field in an atproto record. Arrays are noted by
[]and cannot contain a specific index.
GET /links/count#
The number of backlinks to a URI from a specified collection + json path.
Required URL parameters#
target(required): the URI. must be URL-encoded.- example:
at%3A%2F%2Fdid%3Aplc%3A57vlzz2egy6eqr4nksacmbht%2Fapp.bsky.feed.post%2F3lg2pgq3gq22b
- example:
collection(required): the source NSID of referring documents to consider.- example:
app.bsky.feed.post
- example:
path(required): the JSON path in referring documents to consider.- example:
.subject.uri
- example:
Response#
A number (u64) in plain text format
cURL example: Get a count of all bluesky likes for a post#
curl '<HOST>/links/count?target=at%3A%2F%2Fdid%3Aplc%3A57vlzz2egy6eqr4nksacmbht%2Fapp.bsky.feed.post%2F3lg2pgq3gq22b&collection=app.bsky.feed.like&path=.subject.uri'
40
GET /links/all/count#
The number of backlinks to a URI from any source collection or json path
Required URL parameters#
target(required): the URI. must be URL-encoded.- example:
did:plc:vc7f4oafdgxsihk4cry2xpze
- example:
Response#
A JSON object {[NSID]: {[JSON path]: [N]}}
cURL example: Get reference counts to a DID from any collection at any path#
curl '<HOST>/links/all/count?target=did:plc:vc7f4oafdgxsihk4cry2xpze'
curl '<HOST>/links/all/count?target=did:plc:vc7f4oafdgxsihk4cry2xpze'
{
"app.bsky.graph.block": { ".subject": 13 },
"app.bsky.graph.follow": { ".subject": 159 },
"app.bsky.feed.post": { ".facets[].features[].did": 16 },
"app.bsky.graph.listitem": { ".subject": 6 },
"app.bsky.graph.starterpack":
{
".feeds[].creator.did": 1,
".feeds[].creator.labels[].src": 1
}
}
Contributions#
Licensing#
Constellation's source code is currently available exclusively under the AGPL license (see LICENSE).
In the future, its code MAY become available under the MIT and/or Apache2.0 licenses, at the sole discretion of the microcosm organization. Contributing implies acceptance with this possible future licensing change. The change has not happed yet and is not guaranteed.
some todos
- find links and write them to rocksdb
- handle account active status
- handle account deletion
- handle account privacy setting? (is this a bsky-nsid-specific config and should that matter?)
- instead of looking this up, should be able to listen for it to be published on the firehose.
- this should work, but without backfill it won't be accurate. targeted backfill might be an option.
- instead of looking this up, should be able to listen for it to be published on the firehose.
- move ownership of canonical seq to an owned non-atomic
- custom path for db storage
- api server to look up backlink count
- [~] other useful endpoints for the api server
- show all nisd/path links to target
- get backlinking dids
- paging for all backlinking dids
- get count + most recent dids
- get count with any dids from provided set
- [~] write this readme
- [?] fix it sometimes getting stuck
- seems to unstick in my possibly-different repro (letting laptop fall asleep) after a bit.
- add a detection for no new links coming in after some period
- add tcp connect, read, and write timeouts 🤞
- handle jetstream restart: don't miss events (currently sketch: rewinds cursor by 1us so we will always double-count at least one event)
- especially: figure out what the risk is to rotating to another jetstream server in terms of gap/overlap from a different jetstream instance's cursor (follow up separately)
- jetstream: don't rotate servers, explicitly pass via cli
- metrics!
- event ts lag
- machine resource metrics
- disk consumption
- cpu usage
- mem usage
- network?
- make all rocks apis return Result instead of unwrapping
- [~] handle all the unwraps
- deadletter queue of some kind for failed db writes
- also for valid json that was rejected?
- get it running on raspi
- get an estimate of disk usage per day after a few days of running
- very close to 1GB with data model before adding rkeys to linkers + fixing paths
- make the did_init check only happen on test config (or remove it) (removed)
- actual error types (thiserror?) for lib-ish code
- [~] clean up the main readme
- web server metrics
- origin and ua labels
- tokio metrics?
- handle shutdown cleanly -- be nice to rocksdb
- add user-agent to jetstream request
- wow the shutdown stuff i wrote is really bad and doesn't work a lot
- serve html for browser requests
- add a health check endpoint
- add seq numbers to metrics
- persist the jetstream server url, error if started with a different one (maybe with --switch-streams or something)
- put delete-account tasks into a separate (persisted?) task queue for the writer so it can work on them incrementally.
- jetstream: connect retry: only reset counter after some time has passed.
- either count or estimate the total number of links added (distinct from link targets)
- jetstream: don't crash on connection refused (retry * backoff)
- allow cors requests (ie. atproto-browser. (but it's really meant for backends))
- api: get distinct linking dids (https://bsky.app/profile/bnewbold.net/post/3lhhzejv7zc2h)
- endpoint for count
- endpoint for listing them
- add to exploratory /all endpoint
- nginx: support http2
- nginx metrics
- add TimeoutLayer for axum
- [~] rocksdb metrics
- write ops (count? per actionable?)
- write time hist
- read ops (api)
- expose internal stats?
- figure out what's the right thing to do if merge op fails. happened on startup after an unclean reboot.
- backups!
- manual backup on startup
- background task to create backups on an interval
- add a low-ulimit check on startup?
cache
- set api response headers
- put "stale-while-revalidate" in Cache-Control w/ num seconds
- put "stale-if-error" in Cache-Control w/ num seconds
- set Expires or Cache-Control expires
- add Accept to vary response
- cache vary: might need to take bsky account privacy setting into account (unless this ends up being in query)
data fixes
- add rkey to linkers 🤦♀️
- don't remove deleted links from the reverse records -- null them out. this will keep things stable for paging.
- don't show deactivated accounts in link responses
- canonicalize handles to dids!
- links:
- [~] pull
$type/typefrom object children of arrays (distinguish replies, quotes, etc)- just $type to start
- rewrite the entire "path" stuff
- actually define the format (deal with in-band dots etc)
-
could throw cid neighbour into the target. probably should? but it's a lot of high volume uncompressible bytes- and it could be looked up from the linker's doc
- ^^ for now, look up from source doc to get cid. might revisit this later.
- [~] pull