Skip to content

Conversation

@henrymercer
Copy link
Contributor

Produce a best effort warning if sensitive environment variables are detected in debug artifacts. This is not reliable enough for production but the intent is that the test will run as part of release validation when the bundle update PR is opened.

Risk assessment

For internal use only. Please select the risk level of this change:

  • Low risk: Changes are fully under feature flags, or have been fully tested and validated in pre-production environments and are highly observable, or are documentation or test only.

Which use cases does this change impact?

Workflow types: n/a
Products: n/a

Environments:

  • Testing/None - This change does not impact any CodeQL workflows in production.

How did/will you validate this change?

  • Unit tests - I am depending on unit test coverage (i.e. tests in .test.ts files).
  • End-to-end tests - I am depending on PR checks (i.e. tests in pr-checks).

If something goes wrong after this change is released, what are the mitigation and rollback strategies?

  • Rollback - Change can only be disabled by rolling back the release or releasing a new version with a fix.

How will you know if something goes wrong after this change is released?

The CI check should fail if the artifact scan starts failing.

Are there any special considerations for merging or releasing this change?

  • No special considerations - This change can be merged at any time.

Merge / deployment checklist

  • Confirm this change is backwards compatible with existing workflows.
  • Consider adding a changelog entry for this change.
  • Confirm the readme and docs have been updated if necessary.

Copilot AI review requested due to automatic review settings December 17, 2025 11:44
@henrymercer henrymercer requested a review from a team as a code owner December 17, 2025 11:44
@github-actions github-actions bot added the size/L May be hard to review label Dec 17, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a best-effort scan of debug artifacts to detect potential GitHub tokens before they are uploaded. The scan runs only in test mode and is designed to catch sensitive data leakage early during release validation, though it is explicitly not reliable enough for production use.

Key Changes:

  • New artifact-scanner.ts module that recursively scans files and archives for GitHub token patterns
  • Integration with debug artifact upload to scan artifacts when in test mode
  • Testing utilities updated to support optional console logging
  • GitHub workflow steps added to verify scan completion
  • Custom action to validate that the scan ran successfully

Reviewed changes

Copilot reviewed 10 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/artifact-scanner.ts New module implementing token pattern scanning with recursive archive extraction
src/debug-artifacts.ts Integrates artifact scanner before upload when in test mode
src/testing-utils.ts Adds optional logToConsole parameter to getRecordingLogger
src/artifact-scanner.test.ts Unit tests for the artifact scanner functionality
src/util.ts Adds isInTestMode() helper function
.github/workflows/*.yml Adds verification steps to test workflows
.github/actions/verify-debug-artifact-scan-completed/* Custom action to verify scan completion
lib/*.js Generated JavaScript from TypeScript compilation

Copy link
Contributor

@esbena esbena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have skimmed the code. I don't think it warrants a deep review given its best-effort intentions.

I hoped more of this could have been offloaded to a library.

I worry a bit about the extract/scan/.../delete sequence and limited disk space, but since we are the creators of the asset, we are probably fine in practice. A streaming scan would be preferable.

Hacky solution proposal: pipe all assets through <source> | tar x...... | grep <patterns>. But if we already have something that works, then I can approve without it.


The changes to the generated files seem benign, but they are unexpected. Is that the norm due to non-determinism these days?


// Check if it's an archive file and recursively scan it
const fileName = path.basename(fullPath).toLowerCase();
const isArchive =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you looked at all extensions in our assets in practice?
Haven't we recently played around with something like .xz for instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find <debug artifacts> -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn
3284 gz
2872 dep
 222 js
 111 log
  52 h
  49 yml
  39 json
   9 qls
   8 zst
   8 zip
   4 trap
   4 set
   4 odep
   3 properties
   3 orig
   3 env
   3 args
   2 yaml
   2 s
   2 rb
   2 py
   2 manifest
   2 list
   2 link
   2 kt
   2 jsonl
   2 java
   2 go
   2 codeql-lock
   2 c

I pushed a commit to extract zstd archives too.

@henrymercer
Copy link
Contributor Author

I hoped more of this could have been offloaded to a library.

For extracting archives, I wanted to use system utilities since this is only going to be run in CI. I'm open to suggestions if you have a library in mind to do the job here, but I think the implementation is fairly slim.

I worry a bit about the extract/scan/.../delete sequence and limited disk space, but since we are the creators of the asset, we are probably fine in practice. A streaming scan would be preferable.

If we start seeing failures, it would probably be a good idea to convert to a streaming scan, or ignore things like the source archive. I think since this is CI-only it doesn't necessarily make sense front-loading that work.

Hacky solution proposal: pipe all assets through <source> | tar x...... | grep <patterns>. But if we already have something that works, then I can approve without it.

There are some features like tracking the file in which the token was found, what type of token it was, and not relying on having bsdtar to extract zips. Once we add these, I think we're at the point where I'd rather maintain a JS program than a shell script :)

The changes to the generated files seem benign, but they are unexpected. Is that the norm due to non-determinism these days?

Unfortunately we are bundling this artifact scanner in order to run it before we upload debug artifacts in CI. It's behind the isInTestMode() check, which requires a specific environment variable to be set, but it is being included in the production artifacts.

@henrymercer henrymercer force-pushed the henrymercer/scan-debug-artifacts branch from c2d74f5 to 1b34478 Compare December 17, 2025 15:21
@henrymercer henrymercer force-pushed the henrymercer/scan-debug-artifacts branch from 1b34478 to ac6c41b Compare December 17, 2025 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L May be hard to review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants