Scholscan Design
=================

Article filter that learns from positive examples then filters RSS feeds automatically. Classifier uses TF-IDF on article titles plus logistic regression - fast, no content scraping needed.

Code Structure
---------------

main.go - Entry point, validates commands, dispatches

cmds/
  train.go - Load positive articles, fetch RSS as negatives, train model, output JSON
  scan.go - Fetch articles from RSS, score with model, output filtered results
  serve.go - HTTP server with background feed refresh, embedded web UI, RSS output

core/
  types.go - Article struct holds article data, Config struct for app settings, Command interface
  ml.go - TF-IDF implementation with n-gram support, logistic regression classifier
  model.go - ModelEnvelope for serialized models, model save/load functions
  scoring.go - Score conversion from raw 0-1 to display 1-10 scale
  text.go - HTML content extraction, word tokenization, text cleaning
  http.go - HTTP client with retries, timeouts, user agents
  constants.go - Default timeouts, thresholds, chunk sizes

Training Flow
-------------

Command loads positive examples from JSONL file. Reads RSS URLs from text file (one per line, # comments allowed). Fetches RSS feeds in parallel, removes any articles matching positive URLs. Trains TF-IDF vectorizer then logistic regression on balanced dataset. Finds optimal threshold on validation split using Youden's J metric. Outputs complete model JSON to stdout.

Scanning Flow
-------------

Command fetches specified RSS feed, scores each article using trained model. Articles scoring above threshold output as JSON-Lines (same format as input). Includes enrichment metadata if available. Verbose mode shows fetch and scoring progress to stderr.

Server Flow
-----------

Server loads model and RSS world feed list on startup. Background goroutine refreshes all feeds in parallel every N minutes (configurable). Results cached in memory with RWMutex. HTTP handlers serve both HTML UI and JSON/RSS API endpoints.

API Endpoints
-------------

### HTML Pages
- GET `/` - Redirect to /live-feed
- GET `/live-feed` - Filtered articles web interface (server-rendered)
- GET `/tools` - Manual article scoring interface (server-rendered)

### HTTP Handlers
- GET `/api/filtered/feed` - Articles as JSON array (for external consumption)
- GET `/api/health` - Health check returns {"status":"ok"}
- POST `/score` - Score single article via form post
- POST `/scan` - Scan RSS feed via form post

### RSS Output
- GET `/api/filtered/rss` - Scored articles as RSS feed

Model Details
-------------

Vectorizer uses unigrams plus bigrams. Minimum document frequency 2 (removes typos), maximum 80% (removes stopwords). Vocabulary capped at 50000 terms. Logistic regression with L2 regularization lambda=0.001, learning rate 0.5, 500 iterations. Validation split 80/20 with seed 42 for reproducible results. Threshold selected using Youden's J to balance false positives against false negatives.

Server Implementation
---------------------

HTML templates embedded in binary using embed.FS. All rendering is server-side with no JavaScript. Tools page uses standard HTML forms with POST submissions. Live feed displays cached background results with server-side rendering. Background refresh uses separate goroutine per feed. Results cached with last update time for each feed. RSS output repackages filtered articles into RSS format for consumption.

Key Implementation Notes
------------------------

- Articles processed in 50-item chunks for memory efficiency
- File paths validated against directory traversal attacks
- HTTP requests use custom polite user agent with email contact
- RSS parsing handles both RSS and Atom via gofeed library
- TF-IDF vectorizer stores vocabulary as sorted string array for deterministic ordering
- Model version field allows future format changes
- Background refresh errors logged but don't crash server

External Dependencies
---------------------

gofeed mmcdole for RSS/Atom parsing. All other functionality uses Go standard library only.