Codify Github Access

@timja and I have been chatting in #jenkins-hosting for a while now about finding a way to get permissions for repos managed by code instead of issuing commands to bots.

I did a quick bit of prototyping last weekend and found out that GitHub graphql APIs make it very easy to pull team membership down - getAllTeamsAndMembers.graphql · GitHub - so we can get the existing data, which makes it easy to see who we’d need to invite.

The problem comes, as always, as a naming issue. What format do we want to store this data in? Right now the best place seems to be in GitHub - jenkins-infra/repository-permissions-updater: Artifactory permissions synchronization tool and data set with the rest of the permissions for plugins/core.

Tim’s suggestion (which I really really like) is

name: "acceptance-test-harness"
github: "jenkinsci/acceptance-test-harness"
paths:
- "org/Jenkins-ci/acceptance-test-harness"
maintainers: # intentionally renamed from developers to possibly make it easier to adapt between old and new format, may not be needed
- jenkins_id: "jglick"
  github: "jglick"
- jenkins_id: "olivergondza"
  github: "ogondza"
- group: cloudbees-developers # or team maybe

----

name: "cloudbees-developers"
maintainers:
- jenkins_id: "teilo"
  github: "jtnord"
# ....

Which has the advantages of creating a mapping of Jenkins LDAP ids, and GitHub accounts. Maybe even something we can use or map in keycloak/beta.accounts.jenkins.io

The downside is currently how to populate that. I think the mapping would have to be done by hand. I think for now I can get away with having one row for jenkins_id, and one for GitHub, and not merge them yet.

developers:
- jenkins_id: "jglick"
  release: false # don't give them publish permissions, just commit permissions. 
  # Not really needed when things are split up.
- gitHub: "jglick"
- jenkins_id: "olivergondza"
- github: "ogondza"
- team: cloudbees-developers

So while I prototype it a bit, I figured I would ask others if they had any ideas for layouts or other feedback.

1 Like

Also as a side note, snakeyaml and groovy yaml, neither are round trip, so I can’t keep

issues:
- jira: '21481' # blueocean-plugin

in fact if I write the migration tool in anything other than python. I can’t decide if I like

issues:
- jira: 21481
  jiraComponentName: 'blueocean-plugin'

which doesn’t actually read the component name, just for keeping the data or just to delete it.

Edit: Looks like the # componentname convention was only for the initial import, so its probably not a big deal if it gets deleted. Just depends on how I want to populate existing data.

There are two major problems here:

  • The RPU data structure is trash. It’s (justifiable historically) oriented at Maven artifacts, but we keep adding stuff that’s more applicable to GitHub repos. Multi-module permissions are already error-prone, but JEP-229 already cannot handle multi-module projects at all IIUC. This needs to be overhauled to support more repo-focused content, like this proposal.
  • GitHub permission management is far from trivial, and the way we’ve set them up (and let maintainers change them) is a giant mess: Team names do not always match repo names. Teams grant access to additional (or just different) repos than their name indicates. Tons of maintainers use “external collaborators” to grant access.

This won’t get external collaborators, of which we have 500 (or ~20% of contributors). I’ve struggled with the shitty GraphQL API for a long time to get something that’s mostly working, but that’s only useful for reporting, not for assignment, and it takes forever to scan the entire org with about 4.6M results.

So my quick import script runs in 9seconds, but I can only see the public visible scripts. I’m hoping @timja or org github admin can run it and see what it says for the entire repo. I’m leaning towards just doing teams because it doesn’t allow outside contributors, and its easier to get data about (org admins show up as a contributor on every repo, but not every team)

name: "digitalocean-plugin"
github: "jenkinsci/digitalocean-plugin"
issues:
- jira: '18831' # digitalocean-plugin
paths:
- "com/dubture/jenkins/digitalocean-plugin"
developers:
- "halkeye"
- "anpieber"
- "pulse00"
githubteam:
  visible: true
  members:
  - halkeye
  - anpieber
  - pulse00


requirements.txt

ruamel.yaml
requests

import-committers.py

import os
import sys
import requests
import json
from ruamel.yaml import YAML

repositories = {}
yaml = YAML(typ='rt')   # default, if not specfied, is 'rt' (round-trip)
yaml.preserve_quotes = True
for entry in os.scandir("permissions/"):
    if not entry.is_file():
        continue

    if entry.path.endswith(".yml") or entry.path.endswith(".yaml"):
        with open(entry.path) as stream:

            permission = yaml.load(stream)
            if "github" in permission:
                repositories[permission['github']] = {
                    'filename':  entry.path,
                    'yaml': permission
                }


# Provide a GraphQL query
query = """
query getTeamsAndMembers($after: String) {
  organization(login: "jenkinsci") {
    teams(first: 100, after: $after, query: "Developers") {
      edges {
        node {
          name
          combinedSlug
          privacy
          invitations(first: 100) {
            nodes {
              invitee {
                login
              }
            }
          }
          members {
            edges {
              node {
                login
              }
            }
          }
        }
      }
      pageInfo {
        endCursor
        hasNextPage
      }
    }
  }
}
"""

after = None
while (True):
    # Execute the query on the transport
    r = requests.post(
        'https://api.github.com/graphql',
        json={'query': query, 'variables': {'after': after}},
        headers={'Authorization': f"Bearer {os.environ['GITHUB_PASSWORD']}"}
    )
    result = json.loads(r.text)['data']

    if not result['organization']['teams']['pageInfo']['hasNextPage']:
        break
    after = result['organization']['teams']['pageInfo']['endCursor']

    for edge in result['organization']['teams']['edges']:
        repo = edge['node']['combinedSlug'].replace('-developers', '')
        if repo not in repositories:
            print("Skipping", repo)
            continue

        filename = repositories[repo]['filename']
        parsed = repositories[repo]['yaml']

        if 'githubteam' not in parsed:
            parsed['githubteam'] = {}

        parsed['githubteam']['visible'] = True if edge['node']['privacy'] == 'VISIBLE' else False
        parsed['githubteam']['members'] = []
        for invitationEdge in edge['node']['invitations']['edges']:
            parsed['githubteam']['members'].append(
                invitationEdge['node']['invitee']['login'])
        for memberEdge in edge['node']['members']['edges']:
            parsed['githubteam']['members'].append(
                memberEdge['node']['login'])

        with open(filename, 'w', encoding='utf8') as outfile:
            yaml.dump(parsed, outfile)

I ran this for @halkeye.

It takes about 1min to run across the org getting just the team data.
(I commented out the invite section from above as it was breaking the script)

It can, it is just ugly. Enable CD on `plugin-compat-tester` & clean up obsolete modules by jglick · Pull Request #2103 · jenkins-infra/repository-permissions-updater · GitHub

awesome, thanks both of you.

The next step for me is to figure out a way to validate any new github logins mentioned, without looking up every user in every team every run, which I think is going to be expensive.

But I think this is totally doable.

I wholeheartely support this work, thanks for pushing this Gavin.

I think we should have a separate location to store the mapping github <=> Jenkins. In the current proposal IIUC we’d put both in every single repo’s declarations, which seems like a gigantic duplication (and a recipe for mistakes)?

While I think the mapping is useful, I don’t think we can assume a 1:1 mapping of committers and releasers.

But there’s no reason the same team logic can’t be applied for sharing users between plugins.

My vote is keep it simple. Get it to work, then refactor.

Edit:

For example, anything with cd: true might want no releasers but a bot user.

I think if beta.accounts.jenkins.io Ever takes over (or earlier) could handle GitHub mapping via oauth. So we could use that and just have two lists of Jenkins ids

Ideally this would get an entire design with objectives and explanations etc. (nothing huge, just something like my old Gist on issues metadata. It is really difficult for me to see what you problems you’re addressing, how this will interact with existing stuff like the bot, what to do about migration of current data, and how you plan to address certain edge cases (collaborators!).