A gitscraping action for forgejo

Find a file

Giles Antonio Radford f08ff08001 Better Success Detection: Now checks for multiple success indicators in the response Handles empty responses as potential success cases Verifies the merge by checking the PR state after the merge attempt Improved Error Handling: More descriptive warning messages No longer fails the entire script if the merge check is inconclusive Better parsing of error messages Verification Step: Makes an additional API call to verify the PR's merged state Provides clear feedback about the verification result User Feedback: More detailed status messages Clear indication of success even if the response format is unexpected		2025-06-20 17:54:40 +02:00
action.yml	fixed so only one apt-get	2025-06-20 17:04:19 +02:00
README.md	Updated readme	2025-06-20 17:15:17 +02:00
scrape-site.sh	Better Success Detection:	2025-06-20 17:54:40 +02:00
test-scrape-site.sh	Initial commit	2025-06-20 16:31:44 +02:00

README.md

Git Scraper Action

A Forgejo/GitHub Action that scrapes websites and automatically manages changes through pull requests.

Features

Recursively download websites with wget
Configurable include/exclude patterns for files and directories
Automatic change detection for both modified and new files
Automated pull request creation with optional auto-approval and merging
Support for custom user agent and other wget options
Comprehensive test suite included
Runs in a containerized environment with minimal dependencies

Usage

Basic Usage

name: Scrape Website

on:
  schedule:
    - cron: '0 0 * * *'  # Run daily
  workflow_dispatch:  # Allow manual triggers

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Git Scraper
        uses: 'https://code.moof.space/moof/git-scraper@main'
        with:
          url: 'https://example.com'
          recursive: 'true'
          no_parent: 'true'
          recurse_limit: '2'

Advanced Usage

- uses: 'https://code.moof.space/moof/git-scraper@main'
  with:
    url: 'https://example.com'
    recursive: 'true'
    no_parent: 'true'
    recurse_limit: '3'
    include_files: '*.html,*.css,*.js'
    exclude_files: '*.tmp,*.log'
    include_dirs: 'blog,docs'
    exclude_dirs: 'admin,private'

Inputs

Input	Required	Default	Description
`url`	Yes	-	URL of the website to scrape
`recursive`	No	`false`	Enable recursive download
`no_parent`	No	`true`	Do not ascend to the parent directory
`recurse_limit`	No	`5`	Maximum recursion depth
`include_files`	No	-	Comma-separated list of file patterns to include
`exclude_files`	No	-	Comma-separated list of file patterns to exclude
`include_dirs`	No	-	Comma-separated list of directory patterns to include
`exclude_dirs`	No	-	Comma-separated list of directory patterns to exclude
`forgejo_token`	No	`github.token`	Token for PR automation (requires `repo` scope)
`auto_approve_pr`	No	`false`	Automatically approve created PRs
`auto_merge_pr`	No	`false`	Automatically merge created PRs (requires `auto_approve_pr: true`)
`git_user_name`	No	`Forgejo Action Bot`	Git author name for commits
`git_user_email`	No	`actions@localhost`	Git author email for commits

Change Detection

The action detects the following types of changes:

New files downloaded by wget
Modified files (content changes)
Deleted files
File renames

When changes are detected, the action will:

Create a new branch
Commit all changes (including new files)
Push the branch
Create a pull request (if forgejo_token is provided)
Optionally approve and merge the PR (if enabled)

PR Automation

To enable PR automation, you'll need to:

Create a personal access token with repo scope
Add it as a secret in your repository (e.g., FORGEJO_TOKEN)
Configure the action:

- uses: 'https://code.moof.space/moof/git-scraper@main'
  with:
    url: 'https://example.com'
    forgejo_token: ${{ secrets.FORGEJO_TOKEN }}
    auto_approve_pr: 'true'  # Optional: auto-approve PRs
    auto_merge_pr: 'true'     # Optional: auto-merge approved PRs

Development

Testing Locally

Install dependencies:

apt-get update && apt-get install -y wget jq git

Run the test suite:

chmod +x test-scrape-site.sh
./test-scrape-site.sh

Testing in Forgejo/GitHub Actions

Create a workflow file (.forgejo/workflows/scrape.yml):

name: Scrape Website

on:
  schedule:
    - cron: '0 0 * * *'  # Run daily
  workflow_dispatch:     # Allow manual triggers

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Git Scraper
        uses: 'https://code.moof.space/moof/git-scraper@main'
        with:
          url: 'https://example.com'
          recursive: 'true'
          forgejo_token: ${{ secrets.FORGEJO_TOKEN }}
          auto_approve_pr: 'true'
          auto_merge_pr: 'true'

Building and Publishing

Commit and push your changes to the main branch
Create a new release/tag for versioning

Troubleshooting

No changes detected: Ensure the URL is accessible and contains the expected content
Permission denied: Verify your FORGEJO_TOKEN has the correct permissions (repo scope)
Large repositories: For large websites, consider increasing the workflow timeout in your repository settings

License

MIT

Author

moof