A gitscraping action for forgejo
Now checks for multiple success indicators in the response Handles empty responses as potential success cases Verifies the merge by checking the PR state after the merge attempt Improved Error Handling: More descriptive warning messages No longer fails the entire script if the merge check is inconclusive Better parsing of error messages Verification Step: Makes an additional API call to verify the PR's merged state Provides clear feedback about the verification result User Feedback: More detailed status messages Clear indication of success even if the response format is unexpected |
||
|---|---|---|
| action.yml | ||
| README.md | ||
| scrape-site.sh | ||
| test-scrape-site.sh | ||
Git Scraper Action
A Forgejo/GitHub Action that scrapes websites and automatically manages changes through pull requests.
Features
- Recursively download websites with
wget - Configurable include/exclude patterns for files and directories
- Automatic change detection for both modified and new files
- Automated pull request creation with optional auto-approval and merging
- Support for custom user agent and other
wgetoptions - Comprehensive test suite included
- Runs in a containerized environment with minimal dependencies
Usage
Basic Usage
name: Scrape Website
on:
schedule:
- cron: '0 0 * * *' # Run daily
workflow_dispatch: # Allow manual triggers
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Git Scraper
uses: 'https://code.moof.space/moof/git-scraper@main'
with:
url: 'https://example.com'
recursive: 'true'
no_parent: 'true'
recurse_limit: '2'
Advanced Usage
- uses: 'https://code.moof.space/moof/git-scraper@main'
with:
url: 'https://example.com'
recursive: 'true'
no_parent: 'true'
recurse_limit: '3'
include_files: '*.html,*.css,*.js'
exclude_files: '*.tmp,*.log'
include_dirs: 'blog,docs'
exclude_dirs: 'admin,private'
Inputs
| Input | Required | Default | Description |
|---|---|---|---|
url |
Yes | - | URL of the website to scrape |
recursive |
No | false |
Enable recursive download |
no_parent |
No | true |
Do not ascend to the parent directory |
recurse_limit |
No | 5 |
Maximum recursion depth |
include_files |
No | - | Comma-separated list of file patterns to include |
exclude_files |
No | - | Comma-separated list of file patterns to exclude |
include_dirs |
No | - | Comma-separated list of directory patterns to include |
exclude_dirs |
No | - | Comma-separated list of directory patterns to exclude |
forgejo_token |
No | github.token |
Token for PR automation (requires repo scope) |
auto_approve_pr |
No | false |
Automatically approve created PRs |
auto_merge_pr |
No | false |
Automatically merge created PRs (requires auto_approve_pr: true) |
git_user_name |
No | Forgejo Action Bot |
Git author name for commits |
git_user_email |
No | actions@localhost |
Git author email for commits |
Change Detection
The action detects the following types of changes:
- New files downloaded by wget
- Modified files (content changes)
- Deleted files
- File renames
When changes are detected, the action will:
- Create a new branch
- Commit all changes (including new files)
- Push the branch
- Create a pull request (if
forgejo_tokenis provided) - Optionally approve and merge the PR (if enabled)
PR Automation
To enable PR automation, you'll need to:
- Create a personal access token with
reposcope - Add it as a secret in your repository (e.g.,
FORGEJO_TOKEN) - Configure the action:
- uses: 'https://code.moof.space/moof/git-scraper@main'
with:
url: 'https://example.com'
forgejo_token: ${{ secrets.FORGEJO_TOKEN }}
auto_approve_pr: 'true' # Optional: auto-approve PRs
auto_merge_pr: 'true' # Optional: auto-merge approved PRs
Development
Testing Locally
-
Install dependencies:
apt-get update && apt-get install -y wget jq git -
Run the test suite:
chmod +x test-scrape-site.sh ./test-scrape-site.sh
Testing in Forgejo/GitHub Actions
Create a workflow file (.forgejo/workflows/scrape.yml):
name: Scrape Website
on:
schedule:
- cron: '0 0 * * *' # Run daily
workflow_dispatch: # Allow manual triggers
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Git Scraper
uses: 'https://code.moof.space/moof/git-scraper@main'
with:
url: 'https://example.com'
recursive: 'true'
forgejo_token: ${{ secrets.FORGEJO_TOKEN }}
auto_approve_pr: 'true'
auto_merge_pr: 'true'
Building and Publishing
- Commit and push your changes to the
mainbranch - Create a new release/tag for versioning
Troubleshooting
- No changes detected: Ensure the URL is accessible and contains the expected content
- Permission denied: Verify your
FORGEJO_TOKENhas the correct permissions (reposcope) - Large repositories: For large websites, consider increasing the workflow timeout in your repository settings
License
MIT