2,384 words · 12 min read
February 16 - February 22
Cloud assignment 2 had an average 18 submissions and 4.5 hours spent per-student . I’m happy to report 76% of students achieved a 100%, and 94% received at least a 50%. 5 hours seems like a reasonable amount of time for an assignment that is due for 10 days, and it was likely exacerbated by container build times for several students.
Submission times are clustered right after class, dropping a bit around dinner time, with the biggest spike between 10pm-midnight. There are very few submissions in the morning, students work on cloud assignments in the afternoon and at night. Over a third of submissions were made on the due date. Only a small percentage of students started the assignment on the last day, but 50% of students started working on the assignment only 3 days before it was due, and a majority of students only achieved a perfect score on the due date.
We now have a lot of data about student engagement for an assignment. When we write the Cloudbank proposal, we can use this data to produce a more accurate cost estimation of the course over the whole semester (they require us to provide a “Period of Performance (PoP) Cloud Spent management” table that projects month-by-month cloud spending for the research endeavor). This is also useful cost and engagement data we can bring to request more funding for the class. We’ll hold off on any specific projections until a couple more assignments are complete.
I did k-means clustering on a group of features and the students were best split into two cohorts. The most distinguishing features of these cohorts were (1) over how many days did they work on the assignment, (2) total number of attempts, (3) total number of sessions, (4) estimated time spent, (5) number of days actively working on an assignment. The logistic regression surfaced similar features, none are predictive though. We’re left with common sense: the more time and effort you put into the assignment, the better score you’ll get. Also, we shouldn’t build predictive models for cloud assignments, we’re not here to predict an outcome, but identify what students are struggling and push them all to the best outcome. Predictive analysis is limited in this domain, time is best spent observing descriptive statistics of student progress and effort.
I’ve overhauled how the instructional site is managed on AWS to anticipate:
How the instructional site was intially set up has several downsides:
I’ve addressed these issues, and provided simpler solutions for the eventualities, by creating a dedicated AWS Organization for the CS351 project. Instructional infrastructure is isolated to a dedicated member account of the organization, and students may also be given their own personal member accounts, with all member accounts governed by Service Control Policies (SCPs), which provide greater guardrails than IAM policies.
Future maintainers can take over governance of the organization, or the member accounts can be transferred to another AWS organization under another institution’s control (Cloudbank, Purdue, RCAC). All resources in operation in the moved account at the time of the move will be unaffected. Student accounts may be created at the beginning of a semester, and deleted at the end, and will be governed according to different SCPs than the instructional infrastructure.
Billing is now isolated to each individual member account, making cost easy to measure. Resource use restrictions are easier to audit, and automation to this effect will be simpler to implement.
The instructional infrastructure permission model is simple to understand, it has full access to the capabilities of an AWS account. Let’s discuss a student account’s capabilities and characteristics.
A student’s AWS Account will be:
Let’s discuss some of the initial guardrails, to paint a picture of what’s possible.
I’ve hinted at ways to limit what services a student has access to, and how to limit their choices within a service, but I haven’t discussed how to limit the quantity of those services they’re able to consume. The practical approach is to have an AWS Budget alert at some threshold value ($20, $50, …), which when triggered causes an AWS Lambda function to apply a deny-all SCP to the account that surpassed the threshold. Instructional staff would also be alerted, and would have to get involved to determine what the student did to exhaust their resource quota, and what that means for their assignment grade.
We are now ready to (1) accurately track course infrastructure cost, (2) handover its management to any responsible party, (3) onboard Cloudbank as a funding source for cloud computing at Purdue, and (4) grant accounts to this semester’s student cohort and accept billing responsibility for future cloud assignments.
In order to organize a move to Purdue-controlled AWS accounts, from the instructor perspective, we would have to request a quota increase to the number of AWS accounts we’re allowed to make in an organization, the default is 10. From the student perspective, in-class we would show them how we use the cloud to help manage the course, and explain we are gathering data for a NSF proposal that will secure funding for the class, so we can do cool projects.
I suggest we bring this organization structure to the graduate section of the course. It may help them set up the infrastructure needed for their cloud project at the end of the semester.
graph TD subgraph Before["Before: Single Account"] A1[Personal AWS Account] A1 --- SP[Senior Project Website] A1 --- II[Instructional Infra] A1 --- SA[Student Resources<br/>shared namespace] end subgraph After["After: AWS Organization"] MGMT[Management Account<br/>org definition + billing] MGMT --> INFRA[Infra Account<br/>EC2, RDS, ECR, CloudFront] MGMT --> S1[Student Account 1] MGMT --> S2[Student Account 2] MGMT --> SN[Student Account N] PERSONAL[Personal Account<br/>senior project website<br/>unchanged] end Before --> After
Gradescope deployed an update on Thursday before the assignment was due that affected the test harness they deploy onto autograder instances. We hadn’t updated our container code for a few days, so the error did not originate from us. Gradescope’s test harness phones-home at each startup to download an updated script. Something in that update was not executing the grading script. I reached out to their help email that day. They were responsive, and the problem was fixed in about 6-8 hours. We were not the only group affected by the issue.
This provokes the question, can we operate the course when Gradescope is down? At the moment, no. However, we have a clear path forward in case we want to re-implement any features, it won’t be very difficult. The primary challenge is onboarding students to hold accounts on the instructional site. SSO through Purdue is likely out of the question, so the practical approach is “magic link” onboarding/login that is restricted to Purdue emails, and only those emails assigned to a class within the instructional site. Once students are onboarded, implementing autograding is easy.
If we want to run a “dual-stack”, that is, let students choose between submitting to Gradescope and the instructional site, that is also easy. The Gradescope container simply polls the instructional site for the autograding result, and when displaying the final score in the user interface, will link to a results page hosted on the instructional site.
We do need more error monitoring, I was only alerted to the issue because students on Ed brought it to my attention. Unfortunately, it’s a chicken-and-egg issue. We can’t log a failure if Gradescope doesn’t even run our code.
I’m syncing all activity in the Ed discussion board to the instructional site. I’ve cloned the discussion board, and created an activity summary dashboard.
I’ve also integrated RCAC’s GenAI service and started experimenting with AI summaries of Ed categories.
I have a couple more experiments in mind.
So far, the instructional site has cost me ~$25. The new infrastructure will cost more, I’ve added
In order to audit these costs transparently, I’ve added an “Infrastructure Costs” section to the instructional site that uses the AWS Cost Explorer API and Cloudwatch “EstimatedBilling” statistic to display costs conveniently on-demand.
I had downtime, due to pushing a feature out without an integration test between the autograder container and the instructional site. They were tested invidually, but not together. Compounded by accidentally rotating the assignment API token, which took me a moment to realize how to replace. The first domino that tumbled the rest was PostgreSQL rejected a raw file upload because it contained null characters, for which I had inadequate logging to realize quickly.
I’ve added rolling application deployments to the instructional site so data isn’t lost during deployments. The autograder containers have been updated to perform retries with exponential backoff when submitting student assignment results in order to accommodate blips in API availability during deploys.
I’ve also added rolling EC2 instance deployments by using the Elastic IP feature from AWS. Once I’ve confirmed a new instance is healthy after an AMI upgrade, or change in type/size, I can point the Elastic IP to it and remove the old instance. This is a two-step instance switchover primarily driven through Terraform and Ansible.
Our infrastructure is now more flexible, robust to both application and system upgrades, but could use more monitoring so we can be aware and pro-active when addressing performance issues and bugs.
Submission page load times grew to 10-20s, now at 500ms-1s. Our submission table was growing quadratically in size due to how much data Gradescope includes in the submission_metadata.json file. Our separated table is still growing quadratically in size, but it’s contained, and that problem is easy to address, the production database is 40MB on-disk out of 20GB available.
Remember, we’re still a t4g.small. It has 2 vCPU with burst, but that drops to .4 baseline vCPU. And 2GB of memory. The database is even smaller, a 1GB 1vCPU db.t3.micro, it accrues CPU credits at half the rate of the t4g.small. Both network and storage bandwidth are throttled. The application is pretty fast considering these limitations, burst scheduling fits it well.
Speedups:
Solutions:
Total speedup: ~20x
I will be making an effort to record in-class questions and address them in Cloud Assignment content.
Confusion about CDN vs data center networking. George drew a good distinction between
I will update Cloud Assignment 2, the final CDN section “Push it to the edge”, with a “<Info>” section that addresses some of these ideas.
Confusion about Docker image names (id vs name vs tag). Will be addressed in Cloud Assignment 3
Lots of student issues to work through this assignment, it was not a breeze. the code was good but not great, and lots of bugs slipped through. There were two primary problems:
Many students were stuck at 80 points, failing the last section, because the straightforward path through the assignment could cause caching issues. This part of the assignment I built when I was tired and working quickly, so its no surprise it is challenging to complete for students. This should have been a better experience for them, it was unnecessarily frustrating. Typically, obstacles in cloud computing are a learning experience, but I think this one went a little too far. It should be updated and refined for next semesters class.
Took hours for many students. I counseled one student to build the image on the EC2 instance and it went much faster.
Part of the issue was I had students use Python 3.14.2, the most recent release, and a lot of packages didn’t have pre-compiled binaries available, so it default to compiling them during the build stage. A better Python version target would have been 3.12.
x86 images had a lot of GPU code. I could have given them better hints for how to compile a slimmer CPU-only image.
Technical Motivation: introduce microservices by taking the strangler fig approach, transition our N-tier web architecture to Kubernetes piece-by-piece. Helps “scale” parts of the service that have different reliability and load requirements.
Social Motivation: We have a big engineering team stepping on each others toes. Introduce Conway’s laws, and microservices, as a solution to this approach. (idea: should students work in teams? Gradescope allows group submissions). Different teams have different KPIs and perform different work (reliability vs agility, front-end vs back-end, etc.). How will you as a DevOps engineer make everyone productive?