Infrastructure Team Meeting - Feb. 15, 2022

Participants

Damien Duportal (@dduportal ), Hervé Le Meur (@hlemeur ), Stephane Merle (@smerle), Mark Waite (@MarkEWaite ), Tim Jacomb (@timja)

Official minutes on GitHub.

Announcement :loudspeaker:

  1. Core Security release last week: Jenkins 2.334 + Jenkins 2.319.3
  2. Plugins Security release today (in progress)
  3. Weekly 2.335 delayed for tomorrow (16th of February)
    • Changes Linux installer from System V init to systemd
    • Mark wants to write a blog about the change

Notes :book:

  • ci.jenkins.io outage (Linux Container agent failing to allocate) last friday 10th of Feb.

    • Maven agents not allocating on ci.jenkins.io | Jenkins Status Page
    • Root cause:
      • AWS required an additional IAM permission to allow autoscaling the EKS cluster nodes. No notification, no documentation: not sure how it happened.
      • Symptom: autoscaler component in EKS was logging “Go panic” traces with “Unauthorized 403” error for AWS API
    • Multiple fixes:
      • EKS Terraform module upgraded (with a LOT of breaking changes) hoping that it would help. Spoiler: it did not. But was success though.
        • Faster autoscaling, less resources to pay for (no public IP, etc.)
        • Major version change of the plugin, disruptive shift
      • We found the missing permission and fixed it (thanks @smerle @lemeurherve !!!)
    • TODO:
      • Write a Post Mortem
      • PR to the Terraform EKS module examples/doc to add the new permission
      • Docker Hub credential (see below)
      • Contact AWS support to improve their autoscaler documentation
        • AWS shows less than Terraform
        • Terraform shows less than actual required permissions
  • DockerHub API Rate Limit (again)

  • Digital Ocean

  • Updatecli for terraform

    • @smerle added more tracking with updatecli for Terraform components
    • infra.ci and release.ci have virtual machine templates updated
      • Using same environment on infra.ci and release.ci
      • Moving towards the retirement of trusted.ci
    • Running packer image build
      • Still need to deploy images to a remote registry
        • Moving toward single image on multiple uses (Docker, VM)
  • ci.jenkins.io agent upgrade

  • Building Docker Images on infra.ci/release.ci

    • release.ci and infra.ci have the full VM Docker capability
    • TODO: update docker build shared library to use it instead of img
  • Security of infra.ci

    • Lot of jobs and credentials: need to start “separating” things
      • Deploy web site previews
      • Terraform
      • Puppet
    • TODO: split jobs in folder and move credential per folders
      • Scope credentials by folder
      • Find the jobDSL directive to load the credential
      • Should we move the Netlify preview deployment credential to ci.j.io
      • Could split the Jenkins instance
      • Could use a remote vault system
        • Still accessible by API call to the vault from any job
        • More complicated than we’re ready to do now
    • TODO: do not use shared library changelog as in ci.j
    • TODO: disable github check’s feedback for sensitive jobs (such as kube or terraforms)
  • Alibaba mirror

    • TODO: add the mirror - not yet added to the mirrors
  • ci.jenkins.io as code

  • census.jenkins.io: Damien still has to ask Tyler/Olivier about the “what does it do?”