A gitscraping action for forgejo
Find a file
Giles Antonio Radford f08ff08001 Better Success Detection:
Now checks for multiple success indicators in the response
Handles empty responses as potential success cases
Verifies the merge by checking the PR state after the merge attempt
Improved Error Handling:
More descriptive warning messages
No longer fails the entire script if the merge check is inconclusive
Better parsing of error messages
Verification Step:
Makes an additional API call to verify the PR's merged state
Provides clear feedback about the verification result
User Feedback:
More detailed status messages
Clear indication of success even if the response format is unexpected
2025-06-20 17:54:40 +02:00
action.yml fixed so only one apt-get 2025-06-20 17:04:19 +02:00
README.md Updated readme 2025-06-20 17:15:17 +02:00
scrape-site.sh Better Success Detection: 2025-06-20 17:54:40 +02:00
test-scrape-site.sh Initial commit 2025-06-20 16:31:44 +02:00

Git Scraper Action

A Forgejo/GitHub Action that scrapes websites and automatically manages changes through pull requests.

Features

  • Recursively download websites with wget
  • Configurable include/exclude patterns for files and directories
  • Automatic change detection for both modified and new files
  • Automated pull request creation with optional auto-approval and merging
  • Support for custom user agent and other wget options
  • Comprehensive test suite included
  • Runs in a containerized environment with minimal dependencies

Usage

Basic Usage

name: Scrape Website

on:
  schedule:
    - cron: '0 0 * * *'  # Run daily
  workflow_dispatch:  # Allow manual triggers

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Git Scraper
        uses: 'https://code.moof.space/moof/git-scraper@main'
        with:
          url: 'https://example.com'
          recursive: 'true'
          no_parent: 'true'
          recurse_limit: '2'

Advanced Usage

- uses: 'https://code.moof.space/moof/git-scraper@main'
  with:
    url: 'https://example.com'
    recursive: 'true'
    no_parent: 'true'
    recurse_limit: '3'
    include_files: '*.html,*.css,*.js'
    exclude_files: '*.tmp,*.log'
    include_dirs: 'blog,docs'
    exclude_dirs: 'admin,private'

Inputs

Input Required Default Description
url Yes - URL of the website to scrape
recursive No false Enable recursive download
no_parent No true Do not ascend to the parent directory
recurse_limit No 5 Maximum recursion depth
include_files No - Comma-separated list of file patterns to include
exclude_files No - Comma-separated list of file patterns to exclude
include_dirs No - Comma-separated list of directory patterns to include
exclude_dirs No - Comma-separated list of directory patterns to exclude
forgejo_token No github.token Token for PR automation (requires repo scope)
auto_approve_pr No false Automatically approve created PRs
auto_merge_pr No false Automatically merge created PRs (requires auto_approve_pr: true)
git_user_name No Forgejo Action Bot Git author name for commits
git_user_email No actions@localhost Git author email for commits

Change Detection

The action detects the following types of changes:

  • New files downloaded by wget
  • Modified files (content changes)
  • Deleted files
  • File renames

When changes are detected, the action will:

  1. Create a new branch
  2. Commit all changes (including new files)
  3. Push the branch
  4. Create a pull request (if forgejo_token is provided)
  5. Optionally approve and merge the PR (if enabled)

PR Automation

To enable PR automation, you'll need to:

  1. Create a personal access token with repo scope
  2. Add it as a secret in your repository (e.g., FORGEJO_TOKEN)
  3. Configure the action:
- uses: 'https://code.moof.space/moof/git-scraper@main'
  with:
    url: 'https://example.com'
    forgejo_token: ${{ secrets.FORGEJO_TOKEN }}
    auto_approve_pr: 'true'  # Optional: auto-approve PRs
    auto_merge_pr: 'true'     # Optional: auto-merge approved PRs

Development

Testing Locally

  1. Install dependencies:

    apt-get update && apt-get install -y wget jq git
    
  2. Run the test suite:

    chmod +x test-scrape-site.sh
    ./test-scrape-site.sh
    

Testing in Forgejo/GitHub Actions

Create a workflow file (.forgejo/workflows/scrape.yml):

name: Scrape Website

on:
  schedule:
    - cron: '0 0 * * *'  # Run daily
  workflow_dispatch:     # Allow manual triggers

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Git Scraper
        uses: 'https://code.moof.space/moof/git-scraper@main'
        with:
          url: 'https://example.com'
          recursive: 'true'
          forgejo_token: ${{ secrets.FORGEJO_TOKEN }}
          auto_approve_pr: 'true'
          auto_merge_pr: 'true'

Building and Publishing

  1. Commit and push your changes to the main branch
  2. Create a new release/tag for versioning

Troubleshooting

  • No changes detected: Ensure the URL is accessible and contains the expected content
  • Permission denied: Verify your FORGEJO_TOKEN has the correct permissions (repo scope)
  • Large repositories: For large websites, consider increasing the workflow timeout in your repository settings

License

MIT

Author

moof