Infrastructure Team Meeting - January 23, 2024

Name: 2024 01 16 Jenkins Infra Meeting
Uploaded: 2024-01-24T09:20:15Z
Description: [2024 01 16 Jenkins Infra Meeting] Attendees :busts_in_silhouette: @dduportal (Damien Duportal) @MarkEWaite (Mark Waite) @smerle33 (Stéphane Merle) @poddingue (Bruno Verachten) @kmarten...

dduportal · January 24, 2024, 9:20am

Attendees

@dduportal (Damien Duportal)
@MarkEWaite (Mark Waite)
@smerle33 (Stéphane Merle)
@poddingue (Bruno Verachten)
@kmartens27 (Kevin Martens)
@hlemeur (Hervé Le Meur)

Announcements

Weekly: not today, tomorrow
- Security release
Tomorrow: Core release (LTS, Weekly)
- Security release
Damien unavailable next meeting
FOSDEM (Brussels) next week
- Let’s cancel the weekly meeting the 6th of February. Next: 13th of February
ci.jenkins.io unavailable (for migration) either Thursday and/or Friday
- After the 2.426.3 core upgrade as it depends on the security release

Upcoming Calendar

Next Weekly: tomorrow, 2.442
Next LTS: tomorrow, 2.426.3
Next Security Release as per jenkinsci-advisories: tomorrow (https://groups.google.com/g/jenkinsci-advisories/c/QZiecB2ArMs)
Next major event:
- FOSDEM next week (Brussels)
- SCalex 13/14th March 2024

Notes

Done:
- Lost permission gitlab-branch-source-plugin
  - Wadeck’s security browser extension dedicated to plugin maintainers: GitHub - Wadeck/extension-jenkins-security: Cross-browser extension to ease maintainer's life with advisory preparation
- Jira license on issues.jenkins.io expires in 30 days
  - Thanks LF and Mark!
- Crowdin for next-executions-plugin
- ci.jenkins.io jobs on Windows agents are much slower than 21 days ago
  - Main problem (not completely gone): network issues (SNAT exhaustion)
  - Azure incidents (network… also)
- jenkins/jenkins:lts-jdk11 missing arm64 and s390x
  - Removed all “faulty” tags to get rid of this issue at all.
- Split docker-jenkins-weekly and docker-jenkins-weekly.ci for infra.ci and weekly.ci
  - Different lifecycles for weekly.ci and infra.ci (same core version but different plugin stes)
  - Matter of release lifecycle and distinct updates
Closed as not planned:
Work In Progress:
- [get.jenkins.io/mirrors/mirrorbit - Azure] High costs due to usage of Azure File Storage
  - Migration to premium storage (still shared storage)
  - Let’s create an empty SA of kind premium (from updates.jenkins.io.tf) => @smerle
  - (related question: does LDAP could benefit from the same upgrade)
- Migrate ci.jenkins.io to the sponsored subscription
  - WiP: created a new empty VM in the subscription
  - Then: copy data one time
  - Once the security release is done, we can start the real migration (and run rsync on the diff for data)
  - Expected gain: is $500 monthly
- Azure Kubernetes publick8s suffers from SNAT port exhaustion: network slowness
  - Decreased TCP timeout from 30min to 4min
  - Adding more port statically (instead of dynamically) => did not had any effect, and made operations worse
  - Added more public IP, but it is paid and rare resource
  - WiP: using NAT gateway in addition (and precedence)
    - privatek8s is now using NAT gateway
    - Don’t forget to allow NAT gateway on the control plane of AKS!
    - Todo: publick8s
- Agents aren’t spawning on infra.ci
  - Fixed manually: autoscaling on arm64 fails when minimum amount of nodes is zero, as soon as there is at least 1 then it works
  - Support case opened to AKS. Related to spot?
  - Migrate to new subscription?
- Unexpected delays building small plugin on linux agent
  - Issue related to with DigitalOcean ACP service
  - No problem with Azure and AWS ACPs services
  - Challenge: we would want to not consume too much data from Jfrog this month so let’s not clean it up nor disable it
  - Given priorities, let’s wait with “slow” DigitalOcean builds
  - Long term solutions:
    - Kube 1.27 upgrade could help, but “later”
    - Switch to VM instead of Kube for ACP
  - Short term:
    - Disable ACP DO? => consume more bandwidth
    - Disable DO agents temporarily instead => let’s roll for this
      - chore(controller.ci.jenkins.io): disable DigitalOcean agents by lemeurherve · Pull Request #3266 · jenkins-infra/jenkins-infra · GitHub
- infra.ci.jenkins.io on arm64 (controller and agents)
  - 2 container images already moved to “all in one” with arm64: docker-helmfile and docker-hashicorp
    - Archived these images resources
    - Jobs migrated
    - Terraform 1.6 is now used (instead 1.1 since months)
    - Less PR to review \o/
  - WiP: container image docker-builder (aka. “webbuilder”)
    - Not an easy one: used on both ci.jenkins.io and infra.ci.jenkins.io
    - Jobs such as jenkins.io (websites), contributors, etc. depends on it
    - Also: infra-reports
- Check if we could replace blobxfer by azcopy
  - Fact: we cannot revoke actual SAS
  - WiP: change operations to storage account file shares to use short lived SAS
    - Adding a “policy” to SA to allow large scale revoking
    - Need to generate 1 “Service Principal” in each controler to allow them writing to file shares (websites, mirrors, etc.)
    - Then the az command line will use the Service Principal + policy to generate 1 hour lived SAS to authenticat to SAs
  - Target file share for contributors.jenkins.io and javadoc first, then the others
  - Puppet: install azcopy on pkg VM
- [INFRA-3100] Migrate updates.jenkins.io to another Cloud
  - Good candidate for the above “azcopy” short lived TTL
- [uplink] Download failing for JavaSystemProperties with error: missing chunk number 0 for toast value xx in pg_toast_xxx
  - One corrupted record deleted from the table
  - WiP: Searching for another corrupted record
- [Jenkins Agents] Clean up deprecated JNLP arguments
  - Delay for later (after FOSDEM)
- Revoke an OpenVPN cert for NotMyFault
  - Delay for later (after FOSDEM)
- To host versioned jenkins.io docs on docs.jenkins.io
  - No progress last week (other priorities)
  - Given our GSoc candidate is in exams right now and Kevin is not blocked immediately, let’s delay after FOSDEM
- Intermittent out of memory for Java 21 builds of Jenkins core on ci.jenkins.io
  - No progress last week (other priorities)
  - Delay for later (after FOSDEM)
- Migration left over from publicK8s to arm64
  - LDAP:
    - Image update (was an old one) thanks to Docker Bake
    - arm64 require running on a different availability zone. Current storage is in different zone than arm64 nodes
  - Delay for later (after FOSDEM)
- Export download mirrors list to a textual representation
  - Goal is now clearer and scoped
  - No progress last week (other priorities)
- (new) infra.ci GitHub API rate limiting
  - We could use different GH apps to spread the API rate limits and avoid blocking us from delivering from infra.ci
- Docker Hub:
  - Last 3 releases of the Docker inbound agent images due to “HTTP/429 Rate Limit” on DockerHub. Was an issue on DockerHub
  - Looks like we’ve been removed and added back (confirmed by them), our system were removed from Open Source programe (hence the API rate limit). But it is fixed
  - Thanks Docker for continuing to sponsor us!
  - And new feature are available to us: Docker Scout!
ToDo (next milestone) (infra-team-sync-2024-01-30 Milestone · GitHub)

Topic	Replies	Views
Infrastructure Team Meeting - January 16, 2024 Infrastructure meeting , sig-infra	144	January 24, 2024
Infrastructure Team Meeting - June 06, 2023 Infrastructure meeting , sig-infra	401	June 7, 2023
Infrastructure Team Meeting - May 16, 2023 Infrastructure meeting , sig-infra	398	May 22, 2023
Infrastructure Team Meeting - March 21, 2023 Infrastructure meeting , sig-infra	442	April 3, 2023
Infrastructure Team Meeting - January 17, 2023 Infrastructure meeting , sig-infra	367	January 18, 2023

Infrastructure Team Meeting - January 23, 2024

Attendees

Announcements

Upcoming Calendar

Notes

Related topics