How we host a Debian repository on AWS S3

January 29, 2025·

·Reading time: 4 minutes

Pretty cool fact about our internal Debian repositories: they’re hosted entirely on AWS S3. Why is that cool? For 2 reasons:

There’s no active components to maintain.
It’s super cheap. Like, really.

Now, you will wonder: how is that possible?! Isn’t a Debian repository something super complicated to host?

That is a great question. The answer lies in the technical intricacies of how Debian repositories work, and this is what this post will go through, leading us to the technical prowess of hosting those repositories that are serving hundreds of thousands of VMs and CI jobs on a monthly basis for cheaper than a pack of Kinder Bueno.

In order to explain how it is organized, I believe it is easier to start with what you would typically put in your /etc/apt/sources.list:

deb http://deb.debian.org/debian/ bookworm main non-free-firmware

Let’s anatomize this line and see what each part means:

deb: nothing special here, this just indicates that this line specifies a remote repository.
http: the protocol to use to communicate with the repository. By the way, HTTPS is not necessary (although supported nowadays), as packages are later verified with GPG signatures. See more below.
://deb.debian.org/debian/: URL where the repository lives.
bookworm: this is the distribution. In the Debian world, each version’s name is called the “distribution”.
main, non-free-firmware: those are components. In the Debian world, this is how packages are grouped. While we don’t particularly care for it, it allows Debian to separate the non-free packages, as well as probably other legacy reasons. (Feel free to enlighten us if you know more!)

By giving those 3 parameters (URL, distribution, component), a Debian system can download the list of packages, and install whatever is available.

Now, how do we go back to our S3 repos? Oh, yes. Here is how a Debian system downloads the list of packages, with a few assumptions:

It’s an http repository and this dumb Debian system only knows curl.
We’re taking the easy path. Debian systems provide many features, we only care about the most straightforward ones.
$URL, $distribution, $component are the 3 parameters mentioned above.
$architecture is the CPU architecture, such as amd64.

It does a curl $URL/dists/$distribution/InRelease and curl $URL/dists/$distribution/Release.gpg, where:
- The InRelease file has the list of Packages files (among others) and their hash sums.
- The Release.gpg file has a signature of the InRelease file.
It then does a curl $URL/dists/$distribution/$component/binary-$architecture/Packages.gz, which has the list of packages.

The Packages file looks like this:

Package: 0ad  
Version: 0.0.26-3  
Installed-Size: 28591  
Maintainer: Debian Games Team <pkg-games-devel@lists.alioth.debian.org>
Architecture: amd64 
Depends: [ ... skipped for brevity ... ] 
Description: Real-time strategy game of ancient warfare  
Filename: pool/main/0/0ad/0ad_0.0.26-3_amd64.deb 
Size: 7891488  
MD5sum: 4d471183a39a3a11d00cd35bf9f6803d  
SHA256: 3a2118df47bf3f04285649f0455c2fc6fe2dc7f0b237073038aa00af41f0d5f2

(That’s the first package in https://deb.debian.org/debian/dists/bookworm/main/binary-amd64/Packages.gz)

You will notice several things in there:

With the hash sums, you can actually verify that the package you end up downloading is valid thanks to the top-level signature that was previously downloaded. Transitive trust matters!
There is a Filename property: it points to the actual .deb file that you can download and run dpkg -i on.

I think you can see where I’m going with that. And yes, this is all static. It is all easy to just replicate that hierarchy to AWS S3, and poof, it works. That’s what our CI jobs do: they replicate this folder hierarchy on local disk:

dists/
  bookworm/
    Release.gpg
    InRelease
    main/
      binary-amd64/
        Packages.gz
pool/
  foo.deb
  bar.deb

And then they run aws s3 sync . s3://... (essentially an “rsync”) and magic! We’re done! 🪄

… almost.

Well, yeah, these are internal Debian repositories. We don’t really want them to be open to the world. How do we do authentication, then? We cannot just have public S3 buckets, can we?

On a related note, you might have found it a bit weird how http was separated when it was explained what the sources.list line was made up of. This was intentional: Debian systems support more than one protocol for the remote. Of course, http and https are installed by default.

But you know what’s even better? You can add support for new protocols.

By default, all our Debian systems install apt-transport-s3. This package allows us to have this sources.list:

deb s3://internal-debian-repository.platform.sh bookworm main

Custom protocol, with our S3 bucket URL, and the package allows for authentication details to be provided as part of the URL. (As in, s3://<access key ID>:<secret key>@<bucket name>.)

And that’s it! With these simple tricks, or really, knowledge of how Debian repositories work, it was straightforward to host our own repositories on S3, allowing us to be very flexible with how we organize our repositories. As an example, we define a new Debian component per tag of a given repository, allowing us to upgrade our systems very deterministically.

I hope that was helpful and that you learned something!

Last updated on January 29, 2025