For a long time I have been playing with the concept of Enablement Automation Code as a Product. The alternative is roughly “as a quick start template”. In this case, this is not actually a product - but the effort is managed with all the perspectives as though it were one.
In my experience, this seemingly small shift in perspective causes a butterfly effect of positive outcomes in the work product. It can be seen in this effort with significant effort around “self-service” and “patching built-in”. In this specific case it also caused ML and AI Ops to be in view because many organizations use scaled CI compute to do model processing runs.
Other things Enablement Automation Code as a Product informed in this case include choosing Boring AWS Technology. In this case, sticking with the “Boring” choice of CloudFormation as the Infrastructure as Code (IaC) language enables thousands of IaC Automation Professionals to quickly implement, innovate and contribute. It may also allow this code to eventually be an AWS QuickStart - a sort of Production-Grade IaC Templates Marketplace provided by AWS.
Studying competing offerings and implementing customer requirements are also hallmarks of product management. At a former employer, I managed a very similar effort specifically for GitLab Runner, where I started to experiment with the Enablement Automation Code as a Product perspective - those experiences have informed this solution.
Enablement Automation Code as a Product strongly relates to a previous blog post on why working examples are also the best learning aide: Back to Basics: Testable Reference Pattern Manifesto (With Testable Sample Code)
Features Checklist
Some features in this list are highlights of especially applicable code inherited from The Ultimate AWS AutoScaling Group ASG Lab Kit, but items marked “NEW” are specifically new compared to that code.
It is important to review the README.md to understand if this code appropriate for your use case.
- Self-Service - “Vending Machine” in the name indicates that there is a purposeful production orientation in developing this code to make it deployable by individuals who are not experts in either AWS or GitLab Runner setup. For instance, there is sufficient commentary in the code and in CloudFormation console forms to enable anyone to figure out how to deploy a runner. Finer points of figuring out a smooth autoscaling metric do require more knowledge in these areas - but standing up an HA runner of any of the types supported is fairly straight forward.
- NEW: Runner Management At Scale - Visibility - runner naming and AWS EC2 tagging are used to ensure it is always easy to know where a runner resides in scaled, multi-account AWS implementations. All runner tags are surfaced as a single comma separated EC2 Tag.
- NEW: Designed for Long Term Runner Management At Scale - Easy Patching and Updates - the “maintainability” built-in to the original Ultimate ASG AutoScaling Group Lab Kit allows runner ASGs to be easily update with all of:
- The latest AMI
- the OS patches
- NEW The latest GitLab runner.
- NEW: Spot / Ondemand Compute Type Surfaced as Runner Tags - CI/CD Automation developers can choose what type of compute is acceptable for their purposes at the per-job level of granularity.
- NEW: CloudWatch Instance Metrics Collected for Linux and Windows - mainly for Memory based scaling, but disk and network are also collected so that bottlenecks with these resources can be spotted for specialty workloads like ML Ops. They are also dimensioned on “Instance Type” for the same reason - analyzing bottlenecks by instance type in order to select the best one depending on workload.
- NEW: Scaling on Memory Utilization - workloads that have low CPU utilization, but still consume memory (for example GitLab jobs that just poll for status from another system), may be better scaled using memory utilization.
- NEW: Runner On/Off Scheduling - one stop and one start schedule (provided by ASG scheduled actions) are provided to completely shut the runner down when not being used. Use cases include: a development team specific CI runner that is used only when developers are on working hours, CD runners that are only needed during deployment events to specific environments, runners that are manually shutdown when done - but need to auto-start at a given time of day or week.
- NEW: Provides Linux “docker” Runner Executor for Primary Use Case of docker+machine Scaling Replacement - docker+machine GitLab Runner executor is deprecated because docker machine is deprecated by docker. This replaces the docker level scaling with
- NEW: Provides Windows shell Runner Executor for Primary Use Case of .NET Framework and Other Windows Development - some .NET Framework CI builds and other Windows CI builds may require a full Windows instance due to the build tooling requirements. By having a Windows shell runner that can also auto-scale, development teams with these requirements can have a scaling GitLab runner.
- NEW: Provides Linux Shell and Windows Docker GitLab Runner Executors - while less common, these are provided as well.
- NEW: Extensive Troubleshooting Information Documented - [TESTING-TROUBLESHOOTING] (https://gitlab.com/guided-explorations/aws/gitlab-runner-autoscaling-aws-asg/-/blob/master/TESTING-TROUBLESHOOTING.md) - also linked from README.md
- Designed for Extensibility and Testability - the runner configuration scripts are (a) separate from the cloud formation and (b) downloaded dynamically at scaling time. This enables getting around some code limitations of CloudFormation, but it also enables others to easily create their own runner configurations and store them anywhere. It also enables iterative testing over just the runner script portion by scaling down to 0 and then back up - the dynamic sourcing of the scripts causes updates to be taken. For stronger version pegging or “full immutable automation code” the script can have a version number embedded in the file name and be source from S3.
Code for This Post
GitLab HA Scaling Runner Vending Machine for AWS
Mission Impossible Code Series Inclusion
- The solution sticks to the Boring Technology selection criteria.
- The solution is implemented in a single CloudFormation template.
- The solution implements “Least Privileges”
- The solution implements “Least Configuration” (don’t configure things that the user indicates they won’t use, use blank parameters as an off switch for least config).