Building a Best Practice CloudFormation Custom Resource Pattern

Building a Best Practice CloudFormation Custom Resource Pattern

I was looking to add the ability for users of a CloudFormation template to be able to specify networking, but without overcomplicating the existing parameter set or the required information gathering. Meeting the requirement ended up being the gateway to learning how to create CloudFormation Custom Resources backed by Lambda. While I was at it, I made sure the code could be reused for future Custom Resource needs. This article shares the simplest possible way I could devise for automatically gathering the right information from the fewest parameters. It also presents a pattern for a well written CloudFormation Custom Function with enhanced exception logging and compact code.

darwinterberg30.png "The Whole Berg - Above and Below the Waterline" Posts in the “Mission Impossible Code” Series contain toolsmithing information that is not necessary to reuse the solution - use the iceberg glyphs to know when the content is diving below the water line into “How I Made This”. The content is also designed to be skim-read.

darwinterberg30tip.png "Tip of the Iceberg - Concise Summary Discussion" The “Tip of the Iceberg” icon indicates as simple as possible info on why and what in order to assess and implement.

darwinterberg30base.png "Deep Dive - Below The Water Line Discussion" The “Below The Water Line” icon indicates a deep dive into nitty gritty details of how to take a similar approach to building solutions.

Pearl Diving To Learn Custom Resources, Lambda and Python

During this effort I mused how we sometimes do the equivalent of Pearl Diving when taking on new skills. Its the idea of deep learning a stack of one or more new things while under the pressure of needing the end state code to reflect a maturity level significantly higher than your beginner expertise in that stack. I have captured the details about Perl Diving - and when you should use it (because you usually should not) - in a companion post titled Pearl Diving - Just In Time Learning of Mature Coding Habits For a New Stack

The Mission Objectives and Parameters

darwinterberg30tip.png "Tip of the Iceberg - Concise Summary Discussion" Mission Objectives and Parameters articulate the final objectives that emerged from both the preplanning and build process. Code Summary gives an out line of the code fragments. Code Call Outs highlights significant constraints, innovations and possible alternatives in the code.

  1. Objective: Ensure that AWS VPC / Networking can be specified, with minimal complication of the existing easiest user experience case
  2. Desirable Constraints In Meeting Objective:
    1. Adding the minimum number of new parameters to give the ability to select a network location for the scaling group.
    2. Retaining simplicity of previous version to automatically use the default VPC by default.
    3. Do not exceed code size limitations that would complicate the solution beyond the current simplicity of a single CloudFormation template (just for the sake of code size). Limit for CloudFormation embedded Lambda functions: 4KB . Limit for CloudFormation template file size when loaded from S3: 460.8 KB (note: the limit when submitting from the command line is 51.2 KB)

Code Summary

  1. Shows minimal CloudFormation (CF) Custom Resource, consisting of 3 CloudFormation Resources
    1. The Custom Function declaration - which contains the Python code (“VPCInfoLambda:” in the below)
    2. The IAM Role declaration - permissions used by the Lambda execution (“CFCustomResourceLambdaRole:” in the below)
    3. The “call” to the Custom Function - the input of parameters and return of data through a Cloud Formation Resource interface (“LookupVPCInfo:” in the below)
  2. A fragment shows how the resource data is retrieved inside the definition of a scaling group. (“InstanceASG:” in the below)
  3. A fragment shows the definition of the parameter (“Parameters:” in the below)

Code Call Outs

Using The Code For Subnet Enumeration

  • Note that the parameter “SpecifyVPCToUse” does not use the type “AWS::EC2::VPC::Id” to get a drop down list of actual VPCs because it would defeat the above Desirable Constraint “Retaining simplicity of previous version to automatically use the default VPC by default.” However, if your organization NEVER uses default VPCs or disables them, changing the parameter type to “AWS::EC2::VPC::Id” actually improves the experience because users do not have to lookup VPC ids in the console and the one they select will always exist.
  • Note that the CF functions that retrieve data for “AvailabilityZones:” and “VPCZoneIdentifier:” could add the !Select function to only use a specified number of AZs and Subnets rather than using all available Subnets in the VPC.

Reusing The Code As The Pattern For Other Custom Resources

  • Note that Lambda logging to CloudFormation is configured (including security) and helpful account and region context are output whenever trapped or untrapped exceptions occur.
  • Note the building of the object “responseData” - this is showing how to return multiple values when many examples show a much simpler structure for returning only a single value.
  • Note lines with “raise Exception” reuse the global exception handling for trapped known exceptions to keep code compact by reusing the logging, tracing and verbose error output code containing context cues.
  • Note the lines with “signal.alarm” are timeout handling - which is important in serverless.
  • Note that “CFCustomResourceLambdaRole:” is the least privilege IAM permissions for this function. If you build a function to do other things, the permissions in this YAML will need to be updated to match - but should be kept least privilege.

Source Code For This Article

This Code Working In Production

This code was created for the solution GitLab HA Scaling Runner Vending Machine for AWS

The Code Itself

#Arguments: Vpc-id or "DefaultVPC"
#Returns: vpc-id, number of subnets and ordered list of subnetids and az ids.  
# The index of these two return lists are correlated if it is desirable to choose less than the whole list using the CloudFormation function "Select" against both lists.
  VPCInfoLambda:
    Type: 'AWS::Lambda::Function'
    Properties:
      Description: Returns the lowercase version of a string
      MemorySize: 256
      Runtime: python3.8
      Handler: index.handler
      Role: !GetAtt CFCustomResourceLambdaRole.Arn
      Timeout: 240
      Code:
        ZipFile: |
          import logging
          import traceback
          import signal
          import cfnresponse
          import boto3

          LOGGER = logging.getLogger()
          LOGGER.setLevel(logging.INFO)

          def handler(event, context):
              # Setup alarm for remaining runtime minus a second
              signal.alarm((int(context.get_remaining_time_in_millis() / 1000)) - 1)
              try:
                  LOGGER.info('REQUEST RECEIVED:\n %s', event)
                  LOGGER.info('REQUEST RECEIVED:\n %s', context)
                  if event['RequestType'] == 'Delete':
                      LOGGER.info('DELETE!')
                      cfnresponse.send(event, context, "SUCCESS", {
                           "Message": "Resource deletion successful!"})
                      return
                  elif event['RequestType'] == 'Update':
                      LOGGER.info('UPDATE!')
                      cfnresponse.send(event, context, "SUCCESS",{
                           "Message": "Resource update successful!"})
                  elif event['RequestType'] == 'Create':
                      LOGGER.info('CREATE!')
                      request_properties = event.get('ResourceProperties', None)

                      VpcToGet = event['ResourceProperties'].get('VpcToGet', '')
                      ec2 = boto3.resource('ec2')
                      VpcCheckedList = []
                      TargetVPC = None
                      vpclist = ec2.vpcs.all()
                      for vpc in vpclist:
                          VpcCheckedList.append(vpc.id)
                          if VpcToGet == "DefaultVPC" and vpc.is_default == True:
                              TargetVPC=vpc
                          elif vpc.vpc_id == VpcToGet:
                              TargetVPC=vpc

                      if TargetVPC == None:
                        raise Exception(f'VPC {VpcToGet} was not found among the ones in this account and region, VPC which are: {", ".join(VpcCheckedList)}')
                      else:
                        VPCOutput = TargetVPC.id
                        subidlist = []
                        zoneidlist = []
                        subnets = list(TargetVPC.subnets.all())
                        for subnet in subnets:
                          subidlist.append(subnet.id)
                          zoneidlist.append(subnet.availability_zone)
                        subidOutput = ",".join(subidlist)
                        zoneidOutput = ",".join(zoneidlist)
                        if not subnets:
                          raise Exception(f'There are no subnets in VPC: {VpcToGet}')
                        LOGGER.info('subnet ids are: %s', subidOutput)
                        LOGGER.info('zone ids are: %s', zoneidOutput)

                      responseData = {}
                      responseData['VPC_id'] = VPCOutput
                      responseData['OrderedSubnetIdList'] = subidOutput
                      responseData['OrderedZoneIdList'] = zoneidOutput
                      responseData['SubnetCount'] = len(subidlist)
                      cfnresponse.send(event, context, cfnresponse.SUCCESS, responseData)                                  

              except Exception as err:
                  AccountRegionInfo=f'Occured in Account {context.invoked_function_arn.split(":")[4]} in region {context.invoked_function_arn.split(":")[3]}'
                  FinalMsg=str(err) + ' ' + AccountRegionInfo
                  LOGGER.info('ERROR: %s', FinalMsg)
                  LOGGER.info('TRACEBACK %s', traceback.print_tb(err.__traceback__))
                  cfnresponse.send(event, context, "FAILED", {
                      "Message": "{FinalMsg}"})

          def timeout_handler(_signal, _frame):
              '''Handle SIGALRM'''
              raise Exception('Time exceeded')

          signal.signal(signal.SIGALRM, timeout_handler)


  #Custom Function IAM Role Declaration
  CFCustomResourceLambdaRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: "Allow"
            Principal:
              Service:
                - "lambda.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      Policies:
        - PolicyName: "lambda-write-logs"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: "Allow"
                Action:
                  - "logs:CreateLogGroup"
                  - "logs:CreateLogStream"  
                  - "logs:PutLogEvents"
                Resource: "arn:aws:logs:*:*"
        - PolicyName: "describe-vpcs-and-subnets"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: "Allow"
                Action:
                  - "ec2:DescribeVpcs"
                  - "ec2:DescribeSubnets"
                Resource: "*"

  #Calling Function to Retrieve Data
  LookupVPCInfo:
    Type: Custom::VPCInfo
    Properties:
      ServiceToken: !GetAtt VPCInfoLambda.Arn
      VpcToGet: !Ref SpecifyVPCToUse

Fragment That Demonstrates Parameter Collection

#Parameter declaration with important default
Parameters:
  SpecifyVPCToUse:
    Description: >
      DefaultVPC - finds the VPC and configures all of its subnets for you. Otherwise type 
      in the VPC id of a VPC in the same region where you run the template. 
      All subnets and azs of the chosen vpc will be used.
      The VPC and chosen subnets must be setup in a way that allows the runner instances 
      to resolve the DNS name and connect to port 443 on the GitLab instance URL you provide.      
    Default: DefaultVPC
    Type: String
    # While it is tempting to make the above parameter of type "AWS::EC2::VPC::Id" 
    # this prevents automatic discovery and usage of the DefaultVPC.
    # However, if your organization NEVER uses default VPCs or disables them, changing 
    # the type to AWS::EC2::VPC::Id actually improves the user experience because users do not have to 
    # lookup VPC ids in the console.

#Fragment showing using the resultant data from the custom function
  InstanceASG:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      AvailabilityZones: !Split [",",!GetAtt LookupVPCInfo.OrderedZoneIdList]
      VPCZoneIdentifier: !Split [",",!GetAtt LookupVPCInfo.OrderedSubnetIdList]

Solution Architecture Heuristics: Requirements, Constraints, Desirements, Serendipities, Applicability, Limitations and Alternatives

darwinterberg30base.png "Deep Dive - Below The Water Line Discussion" The following content is a deep dive below the waterline into the nitty gritty details of how to take a similar approach to building solutions.

NOTE: You do not need this information to successfully leverage this solution.

The following list demonstrates the Architectural thrust of the solution. This approach is intended to be pure to simplicity of operation and maintenance, rather than purity of a language or framework or development methodology. It is also intended to have the least possible dependencies. The below is a mix of a) previously committed dispositions for the Overall Solution, b) predetermined design points and c) things discovered and adopted during the development process (emergent or organic solution architecture component).

What Does “<==>” Mean?

The notation “<==>”, which may contain logic like “<= AND =>” is my attempt to visually reflect the dynamic tension or trade-offs inherent in using heuristics to commit to fixing positions on a spectrum of possibilities. During the solution formulation these positions fluctuate as you try to simultaneously tune multiple, interacting vectors through trial and error. Even when I do it on purpose, I still can’t completely understand how I am tuning multiple vectors at once and why the results of the process repetitively turn out to effectively solve for multiple vectors. However the internals work, once you’ve produced a sufficiently satisfactory solution, their final positions reflect a complete tuning. They are sort of like custom presets on a sound equalizer. By documenting them as I have done here - I reveal my impression of the final tuning. I feel this does at least three things for the consumer of this information:

  1. You get to see the iceberg below the waterline of something I have built that I hope is “As simple as possible, but not simpler.” So you get to see why I claim that “The Creation of Simplicity is Necessarily a Complex Undertaking.”
  2. You can more easily customize key parts of the solution to your liking, while retaining the value attached to other parts of the tuning.
  3. You can more easily apply this pattern to new problems that may be like it, but not identical.

Solution Architecture Heuristics for the Overall Solution

The overall solution is solving for “Allow Network Configuration Selection, Using the Least New Parameters and Without Complicating the Existing Easiest User Experience Case

  • Overall Solution Requirement: (Satisfied) Ensure that VPC / networking can be specified.

    • Benefits: Accommodate advanced AWS implementations generally have a design to their VPCs and may even disable the Default VPC. Most CloudFormation templates that are generalized to being a tool must provide this flexibility.
      • Coding Decisions: To actually make the VPC selection enhancement.
  • Overall Solution Requirement: (Satisfied) Add the minimum number of new parameters to give the ability to select a network location for the scaling group.

    • Mission Impossible Heuristic: Make Sophisticated Spy Gadgets <=AND=> Have Simple Controls
      • Benefits: Adoption Driven Development: simple understanding of parameters, simple information collection, easy adoption of new version.
      • Discarded: Innovation Over Defacto Alternatives: It is very common for these types of templates to expose two parameters to take a list of subnet IDs and a list of availability zone names. It is then incumbent on the user to collect these two lists of IDs, ensure they match the desired VPC and ensure that the zones list exactly correlates to the zones of the selected subnets.
      • Coding Decisions: One parameter for the target VPC is used and defaults to the special value “DefaultVPC”. Users who need to use a specific VPC are more likely to know how to look it up or have been provided it by an Infrastructure Automation engineer. Another possibility is that experienced Infrastructure Automation engineer’s create code over this that forces specific VPCs to be used.
  • Overall Solution Requirement: (Satisfied) Retaining simplicity of previous version to automatically use the default VPC by default.

    • Mission Impossible Heuristic: Make Sophisticated Spy Gadgets <=AND=> Have Simple Controls
      • Benefits: the solution is much simpler to use for beginners or when simply kicking the tires.
      • Coding Decisions: The VPC parameter defaults to “DefaultVPC” - when this value is detected, it automatically locates the default VPC and enumerates it’s subnets and availability zones. This actually took a lot design and coding - compared to the previous very simple implementation which used a built-in CloudFormation function to lookup AZs.
  • Overall Solution Limitation: Cannot specify subnets / availability zones

    • Reason: the solution only allows selecting a VPC and then it uses all subnets of the VPC - this is for the sake of simplicity.
    • Limitation Rationale: For a first MVC, being able to select VPC gives most of the benefits of enabling a user to use custom VPCs they have prepared according to their practices with a single parameter.
    • Limitation Architecture: Sometimes subnet level selection is provided because a region only has two AZs - this problem is avoided by enumerating existing subnets. Sometimes subnet level selection is provided because certain instance types don’t exist or are constantly exhausted in an AZ - this problem is avoided because the underlying template is capable of selecting multiple instance types, the unavailability of an instance type in an AZ can be managed by providing multiple instance types in the list.
    • Limitation Removal Anticipation: This solution was also engineered specifically to anticipate the removal of this limitation, yet with minimum parameters, by customizers or in a future release. The CloudFormation Custom Resource returns the total subnets found and “order guaranteed” lists of the subnets and availability zones. This would allow adding a single parameter “Number of Availability Zones To Use” - the !Select CF function could then only use that number of randomly selected subnets, yet with matching AZs in the AZ parameter to AutoScaling Groups.
  • Overall Solution Limitation: Cannot use AWS::EC2::VPC::Id to make VPCs list a drop down in UI based template execution.

    • Reason: This parameter type cannot indicate nor default to the Default VPC. This would inordinately complicate the simplest user experience case which must be retained.

Solution Architecture Heuristics for the CloudFormation Custom Resource Working Pattern

CF Custom Resource Requirement: (Satisfied) Be compact.

  • Mission Impossible Heuristic: Bring Everything You Depend On <= AND => Pack Light.
  • Reason: Code and file size limitations above
  • Coding Decisions: Reuse of the default exception handling for both unanticipated errors and for calling from trapped errors. Used boto3 “resources” over “clients” as code is less verbose and easier to read.

CF Custom Resource Requirement: (Satisfied) Implement proper exception handling <==> despite size concerns <==> be compact.

  • Mission Impossible Heuristic: If You Are Going To Die, Write a Note About Who Done It and Where.
  • Reasons: a) The call sack is deep and remote, b) debugging cycles are long for IaC
  • Coding Decisions: Find out the most concise, functional Python exception handling (Pearl Dive) which is likely NOT the best practice exception handling. Call that exception handling for reporting of caught exceptions. This was implemented with Python exception handling and the Python traceback Python module.

CF Custom Resource Requirement: (Satisfied) Have exceptions report maximum context and, where possible, troubleshooting hints.

  • Mission Impossible Heuristic: If You Are Going To Die, Write a Note About Who Done It and Where.
  • Reasons: a) Improve troubleshooting of external problems (like a mismatch between account and/or region and a specified VPC). b) Like myself, other users of this function may not be experts in CF Custom Resources, Lambda or Python - they may be Deep Diving when trying to troubleshoot problems, c) To get more detail in Lambda logs for an error that was blocking progress - multiple blocking error conditions were immediately cleared upon implementing this.
  • Coding Decisions: Extract the AWS Account ID and AWS Region from the Lambda execution context and report it in exceptions. When reporting a VPC Not Found type error, report the VPC list that was enumerated to give evidence that the VPC actually does not exist and a context hint because the enumerated VPCs can be found. Use the logging module.

CF Custom Resource Requirement: (Satisfied) Always build using least privileged security.

  • Mission Impossible Heuristic: Disclose Everything Needed For Success <=AND=> Use a Need to Know Basis
  • Reasons: a) It’s the right thing to do in all coding, but even more so when promoting or reusing patterns - bad sample code is a known attack vector, b) having “least privileged” security built-in fuels adoption and reuse
  • Coding Decisions: Aside from standard CloudWatch logging access for Lambda, this function needed to read VPC data and enumerate subnets - so a new, clearly named, IAM policy was created and contained only “ec2:DescribeVpcs” and “ec2:DescribeSubnets”.

CF Custom Resource Desirement: (Satisfied) Leverage common and available Python modules to simplify the code.

  • Mission Impossible Heuristic: Build The Best Tools <=AND=> Use Available Tools
  • Reasons: Code is simpler to read and reuse and more compact.
  • Coding Decisions: Use of modules: logging, traceback and signal. Use of module “cfnresponse” - many examples had a lot more code in order to manually implement a cloud formation response to the calling stack.

CF Custom Resource Desirement: (Satisfied) Have this function be the basis of a template for future reuse.

  • Mission Impossible Heuristic: Use The Best Implementation <=AND=> Use Available Tools
  • Reasons: I just always try to do this because real world problems are the best source of raw matter for working examples.
  • Coding Decisions: Demonstrate many required capabilities such as: passing parameters in, passing multiple data values out, timeout handling, exception handling, use generic IAM CF block for future reuse.

CF Custom Resource Serendipity: (Satisfied) Support timeout functionality for Lambda serverless.

  • Mission Impossible Heuristic: Use Ad-Hoc Innovation <=AND=> Use Existing Patterns
  • Reasons: Eliminate the source of tough problems by leveraging the wisdom of examples.
  • Coding Decisions: While looking through many coding examples I noticed a few were monitoring for a timeout and implemented it. It turned out that I had situations where I exceeded the timeout which informed tuning the maximum allowed run time. I also found out later from someone experienced in CF custom resources that the timeout is a critical bit of code. Most example code was missing this.