Using S3 as a Container Registry
Adolfo Ochagavía discusses using Amazon S3 as a container registry, noting its speed advantages over ECR. S3's parallel layer uploads enhance performance, despite lacking standard registry features. The unconventional approach offers optimization potential.
Read original articleAdolfo Ochagavía shares insights on using Amazon S3 as a container registry, highlighting its surprising speed advantages over traditional registries like ECR. By leveraging S3's ability to upload layers in parallel, significant performance gains are achieved compared to ECR's sequential upload process. While S3 isn't a conventional registry, it can serve as one by allowing HTTP requests for pulling images. The experiment showcases S3's potential as a container registry, although caution is advised due to its experimental nature and the absence of standard registry features like image validation and security scans. Despite limitations, the approach opens up possibilities for optimizing container image hosting. The article concludes with a playful reference to the Docker logo and speculates on potential trends in public image hosting. The author invites further exploration and feedback on this unconventional use of S3 for container management.
Related
From Cloud Chaos to FreeBSD Efficiency
A client shifted from expensive Kubernetes setups on AWS and GCP to cost-effective FreeBSD jails and VMs, improving control, cost savings, and performance. Real-world tests favored FreeBSD over cloud solutions, emphasizing efficient resource management.
Simple GitHub Actions Techniques
Denis Palnitsky's Medium article explores advanced GitHub Actions techniques like caching, reusable workflows, self-hosted runners, third-party managed runners, GitHub Container Registry, and local workflow debugging with Act. These strategies aim to enhance workflow efficiency and cost-effectiveness.
Show HN: S3HyperSync – Faster S3 sync tool – iterating with up to 100k files/s
S3HyperSync is a GitHub tool for efficient file synchronization between S3-compatible services. It optimizes performance, memory, and costs, ideal for large backups. Features fast speeds, UUID Booster, and installation via JAR file or sbt assembly. Visit GitHub for details.
DuckDB Meets Postgres
Organizations shift historical Postgres data to S3 with Apache Iceberg, enhancing query capabilities. ParadeDB integrates Iceberg with S3 and Google Cloud Storage, replacing DataFusion with DuckDB for improved analytics in pg_lakehouse.
AWS Secrets Manager Agent
The GitHub repository provides an HTTP service for simplifying access to AWS Secrets Manager in various environments. It offers guidance on building, deploying, and using the Secrets Manager Agent, including configuration, logging, and security details.
> According to the specification, a layer push must happen sequentially: even if you upload the layer in chunks, each chunk needs to finish uploading before you can move on to the next one.
As far as I've tested with DockerHub and GHCR, chunked upload is broken anyways, and clients upload each blob/layer as a whole. The spec also promotes `Content-Range` value formats that do not match the RFC7233 format.
(That said, there's parallelism on the level of blobs, just not per blob)
Another gripe of mine is that they missed the opportunity to standardize pagination of listing tags, because they accidentally deleted some text from the standard [1]. Now different registries roll their own.
[1] https://github.com/opencontainers/distribution-spec/issues/4...
Anyone tried it?
Edit: to clarify, I'm talking about sequentially pushing a _single_ layer's contents. You can, of course, push multiple layers in parallel.
Personally, I just use Nexus because it works well enough (and supports everything from OCI images to apt packages and stuff like a custom Maven, NuGet, npm repo etc.), however the configuration and resource usage both are a bit annoying, especially when it comes to cleanup policies: https://www.sonatype.com/products/sonatype-nexus-repository
That said:
> More specifically, I logged the requests issued by docker pull and saw that they are “just” a bunch of HEAD and GET requests.
this is immensely nice and I wish more tech out there made common sense decisions like this, just using what has worked for a long time and not overcomplicating.
I am a bit surprised that there aren't more simple container repositories out there (especially with auth and cleanup support), since Nexus and Harbor are both a bit complex in practice.
https://distribution.github.io/distribution/storage-drivers/...
Related ECR APIs:
- InitiateLayerUpload API: called at the beginning of upload of each image layer
- UploadLayerPart API: called for each layer chunk (up to 20 MB)
- PutImage API: called after layers are uploaded, to push image manifest, containing references to all image layers
The only weird thing seems to be that you have to upload layer chunks in base64 encoding, which increases data for ~33%.
I do wonder though how you would deal with the Docker-Content-Digest header. While not required it is suggested that responses should include it as many clients expect it and will reject layers without the header.
Another thing to consider is that you will miss out on some feature from the OCI 1.1 spec like the referrers API as that would be a bit tricky to implement.
Awesome. Developer experience is so much better when CI doesn't take ages. Every little bit counts.
I don't see any reason why ECR couldn't support parallel uploads as an optimization. Provide an alternative to `docker push` for those who care about speed that doesn't conform to the spec.
My interest was mainly around a hardening stand point. The base idea was the release system through IAM permissions would be the only system with any write access to the underlying S3 bucket. All the public / internet facing components could then be limited to read only access as part of the hardening.
This would of course be in addition to signing the images, but I don't think many of the customers at the time knew anything about or configured any of the signature verification mechanisms.
There is a real usecase for this in some high security sectors. I can't put complete info here for the security reasons, let me know if you are interested.
the author is missing something huge - ECR does a security scan on upload, too.
It would be nice if a Kubernetes distro took a page out of the "serverless" playbook and just embedded a registry. Or maybe I should just use GHCR
Related
From Cloud Chaos to FreeBSD Efficiency
A client shifted from expensive Kubernetes setups on AWS and GCP to cost-effective FreeBSD jails and VMs, improving control, cost savings, and performance. Real-world tests favored FreeBSD over cloud solutions, emphasizing efficient resource management.
Simple GitHub Actions Techniques
Denis Palnitsky's Medium article explores advanced GitHub Actions techniques like caching, reusable workflows, self-hosted runners, third-party managed runners, GitHub Container Registry, and local workflow debugging with Act. These strategies aim to enhance workflow efficiency and cost-effectiveness.
Show HN: S3HyperSync – Faster S3 sync tool – iterating with up to 100k files/s
S3HyperSync is a GitHub tool for efficient file synchronization between S3-compatible services. It optimizes performance, memory, and costs, ideal for large backups. Features fast speeds, UUID Booster, and installation via JAR file or sbt assembly. Visit GitHub for details.
DuckDB Meets Postgres
Organizations shift historical Postgres data to S3 with Apache Iceberg, enhancing query capabilities. ParadeDB integrates Iceberg with S3 and Google Cloud Storage, replacing DataFusion with DuckDB for improved analytics in pg_lakehouse.
AWS Secrets Manager Agent
The GitHub repository provides an HTTP service for simplifying access to AWS Secrets Manager in various environments. It offers guidance on building, deploying, and using the Secrets Manager Agent, including configuration, logging, and security details.