How we host a Debian repository on AWS S3
Pretty cool fact about our internal Debian repositories: they’re hosted entirely on AWS S3. Why is that cool? For 2 reasons:
- There’s no active components to maintain.
- It’s super cheap. Like, really.
Now, you will wonder: how is that possible?! Isn’t a Debian repository something super complicated to host?
That is a great question. The answer lies in the technical intricacies of how Debian repositories work, and this is what this post will go through, leading us to the technical prowess of hosting those repositories that are serving hundreds of thousands of VMs and CI jobs on a monthly basis for cheaper than a pack of Kinder Bueno.
In order to explain how it is organized, I believe it is easier to start with what you would typically put
in your /etc/apt/sources.list
:
deb http://deb.debian.org/debian/ bookworm main non-free-firmware
Let’s anatomize this line and see what each part means:
deb
: nothing special here, this just indicates that this line specifies a remote repository.http
: the protocol to use to communicate with the repository. By the way, HTTPS is not necessary (although supported nowadays), as packages are later verified with GPG signatures. See more below.://deb.debian.org/debian/
: URL where the repository lives.bookworm
: this is the distribution. In the Debian world, each version’s name is called the “distribution”.main
,non-free-firmware
: those are components. In the Debian world, this is how packages are grouped. While we don’t particularly care for it, it allows Debian to separate the non-free packages, as well as probably other legacy reasons. (Feel free to enlighten us if you know more!)
By giving those 3 parameters (URL, distribution, component), a Debian system can download the list of packages, and install whatever is available.
Now, how do we go back to our S3 repos? Oh, yes. Here is how a Debian system downloads the list of packages, with a few assumptions:
- It’s an
http
repository and this dumb Debian system only knowscurl
. - We’re taking the easy path. Debian systems provide many features, we only care about the most straightforward ones.
$URL
,$distribution
,$component
are the 3 parameters mentioned above.$architecture
is the CPU architecture, such as amd64.
- It does a
curl $URL/dists/$distribution/InRelease
andcurl $URL/dists/$distribution/Release.gpg
, where:- The
InRelease
file has the list ofPackages
files (among others) and their hash sums. - The
Release.gpg
file has a signature of theInRelease
file.
- The
- It then does a
curl $URL/dists/$distribution/$component/binary-$architecture/Packages.gz
, which has the list of packages.
The Packages
file looks like this:
Package: 0ad
Version: 0.0.26-3
Installed-Size: 28591
Maintainer: Debian Games Team <pkg-games-devel@lists.alioth.debian.org>
Architecture: amd64
Depends: [ ... skipped for brevity ... ]
Description: Real-time strategy game of ancient warfare
Filename: pool/main/0/0ad/0ad_0.0.26-3_amd64.deb
Size: 7891488
MD5sum: 4d471183a39a3a11d00cd35bf9f6803d
SHA256: 3a2118df47bf3f04285649f0455c2fc6fe2dc7f0b237073038aa00af41f0d5f2
(That’s the first package in https://deb.debian.org/debian/dists/bookworm/main/binary-amd64/Packages.gz)
You will notice several things in there:
- With the hash sums, you can actually verify that the package you end up downloading is valid thanks to the top-level signature that was previously downloaded. Transitive trust matters!
- There is a
Filename
property: it points to the actual.deb
file that you can download and rundpkg -i
on.
I think you can see where I’m going with that. And yes, this is all static. It is all easy to just replicate that hierarchy to AWS S3, and poof, it works. That’s what our CI jobs do: they replicate this folder hierarchy on local disk:
dists/
bookworm/
Release.gpg
InRelease
main/
binary-amd64/
Packages.gz
pool/
foo.deb
bar.deb
And then they run aws s3 sync . s3://...
(essentially an “rsync”) and magic! We’re done! 🪄
… almost.
Well, yeah, these are internal Debian repositories. We don’t really want them to be open to the world. How do we do authentication, then? We cannot just have public S3 buckets, can we?
On a related note, you might have found it a bit weird how http
was separated when it was explained what the
sources.list
line was made up of. This was intentional: Debian systems support more than one protocol for the remote.
Of course, http
and https
are installed by default.
But you know what’s even better? You can add support for new protocols.
By default, all our Debian systems install apt-transport-s3
. This package allows us to have this sources.list
:
deb s3://internal-debian-repository.platform.sh bookworm main
Custom protocol, with our S3 bucket URL, and the package allows for authentication details to be provided as part of the
URL. (As in, s3://<access key ID>:<secret key>@<bucket name>
.)
And that’s it! With these simple tricks, or really, knowledge of how Debian repositories work, it was straightforward to host our own repositories on S3, allowing us to be very flexible with how we organize our repositories. As an example, we define a new Debian component per tag of a given repository, allowing us to upgrade our systems very deterministically.
I hope that was helpful and that you learned something!