2,376 words · 12 min read
January 26 - February 1
I built a docker test harness. It speeds up development as I don’t need to deploy to Autograder to test the autograding script. Still need to publish as .zip because we don’t have a class container image registry.
Autograder has a custom test harness. I’m trying to hook into it to distinguish between test failures from assertions and test failures from the test harness being wrong. I want students to get a message “TestHarness failed: Contact Ed” so they don’t waste time thinking they did something wrong, and I get notified quickly I need to update the autograder script. It’s tricky with Gradescope’s setup though.
I made a choice with the autograder harness that if a non-AssertionError occurs (that is, something is wrong with how the test is coded rather than with what the student implemented) it will display a message and a stack trace to the student, prompting them to share it with a TA (me) on Ed. This is the best I can do for Cloud Assignment 1, because we don’t have any infrastructure for storing and responding to errors. I can build that later.
Cloud assignment 2 will consist of the implementation of two features, and the expansion of system architecture to accommodate those features.
The motivation for the concepts I’m introducing in Cloud Assignment 2 is summarized by the following blurb: “In Cloud Assignment 1, bird.ai was building their MVP. Now, they’ve raised a seed round and are ready to get more customers”
The first feature is a history of submitted images and the resulting classification. This introduces a need to store user-provided photos and keep track of submissions per-user. Disk space on a server is limited (especially on the AWS free tier), and students will be walked through why we transition data off disk into object storage, in this case, S3. By carefully managing on-disk data usage, we are able to scale the history feature to many more users and image submissions without greatly expanding the storage capabilities of the servers themselves, saving money and improving service resilience.
Now that images are stored in AWS S3, students will have to track submission metadata and content in order to recover and display that information to the user through a user interface. To that end, they will have to implement a simple Django data model that will power the history feature experience, introducing them to the “Model” in “Model-View-Controller”, highlighting how system architecture decisions must be tied to application-level implementations.
It’s not enough that images are stored in S3. Content-delivery networks, or CDNs, are an important user experience and cost-saving measure. Users are sensitive to latency, and delivering static images is not something either Python or S3 is exceptional at. Any image requests from either of those resources have to travel to the origin server in a specific AWS Region, latency grows for users further away than that region. Additionally, pulling data out of S3 is expensive in terms of data transfer. We will use these motivations to introduce AWS Cloudfront as a CDN server in front of the S3 bucket, reducing latency by pushing static assets to the edge of the AWS global network. AWS Cloudfront is both lower latency and lower cost, reducing pressure on origin servers, and improving key metrics of the user experience. Cloudfront is lower latency by taking advantage of 100+ “Points of Presence”, or small regional data centers, around the globe, running software that is optimized for serving media. Cloudfront is 10x lower cost than S3 due to different trade-offs made in its architecture and to encourage customers to reduce load on S3 servers.
The second feature I would like students to implement is geo-location related to image submissions (Note: this may get pushed to the third assignment depending on assignment 2’s length). Images have EXIF metadata that, among other parameters, records the longitude and latitude where a photo was taken. Students will be asked to strip the metadata from each submitted photo and store it in the database, using that data to populate a map showing where every photo was taken. This feature is meant to demonstrate to students how much information you are sharing with cloud software vendors when you use their services. There is a possibility to grab the IP of the user at time of submission and perform IP-based geo-location, but it can be difficult to find a IP to location database with a permissive licensing model. Students will learn how to query PostgreSQL using its geospatial capabilities to power the feature, and a mapping service will be hosted, introducing multi-node service architectures. Again, this is a stretch goal and will likely be introduced in Cloud Assignment 3.
The final application architecture will consist of an AWS RDS instance running PostgreSQL instead of SQLite. It introduces different types of relational databases, and gives an opportunity to contrast SQLite’s in-memory database model with PostgreSQL’s client-server model. PostgreSQL is commonly used in industry, and has a vast open-source ecosystem, which students can take advantage of for self-learning. Students will deploy their updated application onto an EC2 instance, and configure the application to connect to the AWS-managed database, giving them valuable DevOps skills. Students will also have to write a Dockerfile for their Django application, so they codify concepts and commands learned in Cloud Assignment 1 and put them to work containerizing the application.
This is the nuts and bolts of software development, and students’ mental-model of software will be challenging by pushing this assignment to production.
It’s important to touch on what was not introduced in this assignment. Ansible has been ommitted, so students gain more experience manually configuring servers. Infrastructure-as-Code has not made an entrance either, students will gain more familiarity with the AWS console by setting up AWS Cloudfront and S3 manually in this assignment. Orchestration has not been introduced yet either, we are slowly ramping up the complexity of our application, creating a correspondence with how software grows in the real-world, organically, through communication with users, with changes being motivated by technical or product needs.
There is an opportunity to disucss virtual networking in Linux by how Docker sets up container networking, which will be mentioned in the assignment as well. Students will be encouraged to explore the network configuration on the EC2 instance, and to answer questions about what changes the Docker daemon makes when running containers. Virtual storage will also be mentioned, in the context of Docker Volumes, EBS, and S3.
Cloud Assignment 3 coincides with two important chapters in the textbook: Automation and Orchestration. These practices are important when scaling a software service beyond a single server into multiple supporting resources.
This assignment will introduce an N-tier web architecture for the bird.ai SaaS. All previous features will continue to be supported, but focus will be place on a system architecture meant for scaling. Part of the deliverables for the assignment will be producing an architecture that can handle a sustained load test from the autograder. Another focus will be placed on the instance types chosen for each component of this service. Students are presented with a wide array of choices from EC2, how will they make decisions about what to use? By connecting how a software operates to the hardware that best supports that operation is an important principle to be aware of, and will help students become informed consumers of cloud services.
Let’s discuss what an N-tier architecture looks like. First web application servers are meant to be scaled horizontally. They are typically stateless, and handle a variety of tasks. They are best run on general-purpose instances. Multiple application servers must be load balanced, and a proxy server will be introduced, requiring a memory-optimized instance to support caching and a higher number of concurrent connections. Finally, application servers will connect to a shared database, which should be run on hardware that is IO-optimized and whose architecture has been well-tuned for that type of workload. All these pieces expand a students perspective from an application as a single process on a single machine, to a suite of processes across multiple machines that work together to achieve more than they could separately. It also forces them to inspect and understand the software they are running, so they can choose the right platform to run it on.
Having more than one server introduces challenges that will motivate automation-based practices to initialize servers and deploy applications. This assignment will introduce Ansible, and students will have to write playbooks for common tasks. The assignment will also introduce Terraform, an common tool in industry, which will be useful for them to manage the complexity of this assignment with less risk of incurring costs due to unused resources remaining provisioned. The load test performed against their architecture comes with risks, crashing EC2 instances or web servers. Their automation scripts will allow them to get back to where they were before quickly, with the threat of a crash being additional motivation to have automation already defined for a software service.
I had initially thought this assignment would be a good opportunity to introduce Kubernetes. I believe that will wait until Microservices have been introduced, during the time period for Cloud Assignment 4. It will already be a lot to introduce multi-node architectures and different automation practices to students. Kubernetes is such a big topic (auto-scaling, control-planes, scheduling, software-defined networking, etc.) it merits a more focused assignment.
I’ve updated the course calendar to account for Spring Break and to give a week break to students after each midterm. The projected course calendar is now on this project’s home page.
I’ve developed a script that converts assignment assets to a single file that is easy to publish to Brightspace. It inlines styles, fonts, and images into the HTML document itself, making it completely self-contained for rendering by a web browser.
I’ve also added Grace to my repository holding my senior project files, so she can stay on top of progress and have a point of reference if she has questions about how assignments are developed or tested.
Grace will be taking over the class after I’ve graduate. She uses a markdown editor for writing, it will be ideal if I could integrate a WYSIWYG editor with Astro so she could edit assignments more easily, and I could turn the publishing workflow into a GUI rather than TUI interface.
Possible approaches:
During a meeting with Grace, Justin Gillingham popped in. He asked us how we were managing the AWS free tier, and I explained our current approach to cloud assignments and outlined how were were planning resource utilization across the course to make sure students were staying with the free tier. He specifically asked what we were doing for students who had already exhausted their free tier. I told him of the two students who had raised the issue already, and how I told them it would be best if they could get a credit card from a parent who likely hadn’t used the free tier before, and in particular for this first assignment, it could be completely quickly with only a few cents in charges if they were ok with that.
He then mentioned there is a NSF program called “CloudBank” which provides credits for different cloud platforms to researchers. He suggested we write a proposal and try to get credits through the program, which would grant us an AWS account pre-filled with credits we would then use for instruction. We would have to figure out a way to manage cloud resources so students don’t consume too much.
Justin also let me know there is a graduate section of the course that Douglas Comer is teaching right now, he offered to put me in touch with the GTA. I’m curious what assignments they have planned for the semester.
Gradescope’s Autograder feature relies on executing instructor-created grader scripts in a Docker container to produce the resulting grade for a student’s cloud assignment submission.
The typical use case is to run and verify a student’s code submission, perhaps by running unit tests in a sandboxed environment. Our usage is different, there’s nothing to sandbox, we reach out of the environment, into the students cloud infrastructure. A student submission is not code at all, it’s a permission slip, sharing AWS credentials that grant read-only access to their account.
The container that runs on every student submission may be defined in two ways: as a zip file, whose upload triggers a rebuild of the autograder container for that assignment, and as a URL pointing to an image in a container repository. That image in the container repository is pulled everytime a student submits their assignment, and I’ll be exploring it in the next few paragraphs.
Any autograder container has to derive from Gradescope’s base image
gradescope/autograder-base. Currently, the base image is built from Ubuntu
22.04 for x86 and has not been updated in 2 years. Its image layers contain
test harness logic and an SSH configuration for monitoring its execution. A
metadata file and the student submission is mounted into the container at
runtime and the container fetches the latest test harness code from
Gradescope’s S3 bucket in us-west-2 before executing the “run_autograder”
script provided by an instructor. The default python installation is version
3.10.12 (released Jun 2023). A final quirk, it uses dumb-init instead as PID
1 in the container instead of the more typical tini.
Our Elastic Container Registry must be accessible by Gradescope’s autograder. That’s only possible as a public repository, which students could theoretically find. The only other options are to host the images using GitHub or DockerHub. I have reached out to Gradescope support about this issue, who responded and informed me they have notified the technical team. I am awaiting a response.
After some more research, there may be a way to circumvent this issue. Gradescope’s own infrastructure is hosted on AWS, and I’m able to glean their AWS Account ID by the default configuration of autograder containers. I may be able to grant access to the private ECR repository with only that ID. I will experiment with that this week.