in a previous post, i introduced cryptogen as high-level distributed storage system based on cryptographic capabilities. here i take the same ideas and look at them through a different lens. most of the same goals apply, namely we want:
- partially replicated, queryable datasets
- post-quantum crypto
- active data protection
the cryptogen project will be a place to experiment with epidemic broadcast trees [^ebt] and DHT-based replication in a cloud format. the goal is to make a self-healing storage system that works with the ephemeral nature of kubernetes-style cloud deployments to handle persistence. creating a model for treating data as a self-interested organism is an important precursor for tracker. a wireless mesh is a much less stable environment than cloud, but cloud itself is much less stable than the mainframes that it replaced. because of this new modality in software deployment, developers often find themselves fighting their tools in the process of developing cloud-based solutions. some of the most important components in any application are simply not designed to sit on top of ephemeral or elastic infrastructure.
- adaptive query servers
popular database design has waffled a lot between SQL, NoSQL, NuSQL, and many other concepts. cryptogen is designed to give you a reliable key/value store with eventually-consistent replicated based on delta-CRDTs. higher-level replication and querying happens in a client program that uses an internal query language based on datalog [^datalog-1] [^datalog-2] [^datalog-3] to generate query strategies and to inform replication decisions. consumers of this distributed system can use thin protocol adapters that translate different wire formats into datalog queries. similar to foundation db, cryptogen attempts to deconstruct the monolithic database into several independent components that have a much narrower scope. in cryptogen, this enables us to use an epidemic model for data replication.
cryptogen is designed to embed itself inside of your application. even small ec2 instances include enough disk to be useful for storing and replicating encrypted shards. using this ephemeral storage introduces some challenges, but this is actually the question i want to answer: how can a database maintain its integrity when the storage substrate is not reliable. additionally, what are the limits on self-healing for the EBT/DHT design and what parameters influence that boundary? in a cloud deployment, cryptogen can report this information to the control surface to adjust the number of replicas that should be kept alive to maintain the current dataset. persistent servers can be used as an additional layer of protection against dataloss, but the goal is to make these optional and non-critical for the normal operation of the system.
- content-addressed buckets
- native embedded caching
- ipfs vs hbase [^ipfs-vs-hbase]
- ndn in hadoop [^ndn-in-hadoop]
- what the heck is the epidemic model for data replication? [^ebt]
- tunable consistency
- mesh consensus vs quorum consensus
data locality is critical for high throughput workloads. cryptogen uses named data networking under the hood to efficiently replicate and cache database records at the network level. operating at this level, we can be aware of a lot more information about the state of the database.
various overlays are possible including websocket clients. additionally, cryptogen can be deployed on bare metal on a working ethernet with no ip infrastructure. other physical transport is also possible, including: standard and ad-hoc wireless networks. theoretically, any device that can run a networking stack and a small applet could directly peer over any wireless or wired transport that can send data packets.
[^bep0030]: Merkle hash torrent extension
[^bep0032]: BitTorrent DHT Extensions for IPv6
[^bep0044]: Storing arbitrary data in the DHT
[^bep0046]: Updating Torrents via DHT Mutable Items
[^bep0050]: Pub/Sub Protocol
[^bep0052]: The BitTorrent Protocol Specification v2
[^datalog-1]: Weidong Chen; Terrance Swift; David S. Warren
[^datalog-2]: Chen, Weidong; Warren, David S.
[^datalog-3]: Ceri, S.; Gottlob, G.; Tanca, L.
[^ipfs-vs-hbase]: Scott Ross Brisbane
[^ndn-in-hadoop]: Newberry, Eric; Zhang, Beichuan
more future work includes decentralized, block-lattice-linked passports, but i have to finish writing my cryptogen paper before we can do that.
cryptogen has a lot of similarities to ssb, but i’m explicitly trying to avoid the whole “blockchain social network” thing. i think its a good place for identity, trust, and naming, but immutable social networks aren’t that great imo. i don’t necessarily want my drunk shitposts or personal message engraved in a permanent digital archive.
i’ve thought about it, but I’m not sure how applicable public block lattice regions are to the problem I’m trying to solve with cryptogen. the design is inspired by Web of Trust in some ways, but I’ve put a lot of consideration into protecting social metadata. enabling public regions undermines those protections in a lot of ways.
I haven’t been completely talked out of thinking that cryptographic pet names are a good idea. that said, the best course of action is likely to take a hybrid approach to naming. in the vast majority of cases, a pet name is sufficient. however, there is some value in having difficult-to-impossible to forge names.
this is another instance where cryptogen’s cross-chain linking concepts are extremely useful. pet names can be published in native per-user blockchains and you can use the same interface to interact with Namecoin, for example, and link the transaction to your lattice region.
the big deal here, is that you have a generalized interface for interacting with one or more blockchains. this means you can write cross-chain smart contracts and publish easily authenticated records with variable cost, depending on your needs. the goal is to provide a means to replace your PKI and DNS systems with a faster, more secure, and trustworthy system that provides tools to ease your transition into the mesh.
one of the subsection of cryptogen that i’m really interested in exploring further is the anonymous gossip protocol. it was originally designed around using IPFS as the storage and replication system, plus some extra work to get anonymity.
i need to dig more into the routing protocols that make NDN work, but it would be interesting to see how a content-addressing scheme would work under named data. anonymous NDN is very promising and the overhead is minimal compared to Tor et al.