diff options
| author | Sam Scholten | 2025-12-15 19:34:17 +1000 |
|---|---|---|
| committer | Sam Scholten | 2025-12-15 19:34:59 +1000 |
| commit | 9f5978186ac3de07f4325975fecf4f538fe713b6 (patch) | |
| tree | 41440b703054fe59eb561ba81d80fd60380c1f7a /DESIGN.md | |
| download | scholscan-9f5978186ac3de07f4325975fecf4f538fe713b6.tar.gz scholscan-9f5978186ac3de07f4325975fecf4f538fe713b6.zip | |
Init v0.1.0
Diffstat (limited to 'DESIGN.md')
| -rw-r--r-- | DESIGN.md | 81 |
1 files changed, 81 insertions, 0 deletions
diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 0000000..dba3394 --- /dev/null +++ b/DESIGN.md @@ -0,0 +1,81 @@ +Scholscan Design +================= + +Article filter that learns from positive examples then filters RSS feeds automatically. Classifier uses TF-IDF on article titles plus logistic regression - fast, no content scraping needed. + +Code Structure +--------------- + +main.go - Entry point, validates commands, dispatches + +cmds/ + train.go - Load positive articles, fetch RSS as negatives, train model, output JSON + scan.go - Fetch articles from RSS, score with model, output filtered results + serve.go - HTTP server with background feed refresh, embedded web UI, RSS output + +core/ + types.go - Article struct holds article data, Config struct for app settings, Command interface + ml.go - TF-IDF implementation with n-gram support, logistic regression classifier + model.go - ModelEnvelope for serialized models, model save/load functions + scoring.go - Score conversion from raw 0-1 to display 1-10 scale + text.go - HTML content extraction, word tokenization, text cleaning + http.go - HTTP client with retries, timeouts, user agents + constants.go - Default timeouts, thresholds, chunk sizes + +Training Flow +------------- + +Command loads positive examples from JSONL file. Reads RSS URLs from text file (one per line, # comments allowed). Fetches RSS feeds in parallel, removes any articles matching positive URLs. Trains TF-IDF vectorizer then logistic regression on balanced dataset. Finds optimal threshold on validation split using Youden's J metric. Outputs complete model JSON to stdout. + +Scanning Flow +------------- + +Command fetches specified RSS feed, scores each article using trained model. Articles scoring above threshold output as JSON-Lines (same format as input). Includes enrichment metadata if available. Verbose mode shows fetch and scoring progress to stderr. + +Server Flow +----------- + +Server loads model and RSS world feed list on startup. Background goroutine refreshes all feeds in parallel every N minutes (configurable). Results cached in memory with RWMutex. HTTP handlers serve both HTML UI and JSON/RSS API endpoints. + +API Endpoints +------------- + +### HTML Pages +- GET `/` - Redirect to /live-feed +- GET `/live-feed` - Filtered articles web interface (server-rendered) +- GET `/tools` - Manual article scoring interface (server-rendered) + +### HTTP Handlers +- GET `/api/filtered/feed` - Articles as JSON array (for external consumption) +- GET `/api/health` - Health check returns {"status":"ok"} +- POST `/score` - Score single article via form post +- POST `/scan` - Scan RSS feed via form post + +### RSS Output +- GET `/api/filtered/rss` - Scored articles as RSS feed + +Model Details +------------- + +Vectorizer uses unigrams plus bigrams. Minimum document frequency 2 (removes typos), maximum 80% (removes stopwords). Vocabulary capped at 50000 terms. Logistic regression with L2 regularization lambda=0.001, learning rate 0.5, 500 iterations. Validation split 80/20 with seed 42 for reproducible results. Threshold selected using Youden's J to balance false positives against false negatives. + +Server Implementation +--------------------- + +HTML templates embedded in binary using embed.FS. All rendering is server-side with no JavaScript. Tools page uses standard HTML forms with POST submissions. Live feed displays cached background results with server-side rendering. Background refresh uses separate goroutine per feed. Results cached with last update time for each feed. RSS output repackages filtered articles into RSS format for consumption. + +Key Implementation Notes +------------------------ + +- Articles processed in 50-item chunks for memory efficiency +- File paths validated against directory traversal attacks +- HTTP requests use custom polite user agent with email contact +- RSS parsing handles both RSS and Atom via gofeed library +- TF-IDF vectorizer stores vocabulary as sorted string array for deterministic ordering +- Model version field allows future format changes +- Background refresh errors logged but don't crash server + +External Dependencies +--------------------- + +gofeed mmcdole for RSS/Atom parsing. All other functionality uses Go standard library only. |
