From 9f5978186ac3de07f4325975fecf4f538fe713b6 Mon Sep 17 00:00:00 2001 From: Sam Scholten Date: Mon, 15 Dec 2025 19:34:17 +1000 Subject: Init v0.1.0 --- DESIGN.md | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) create mode 100644 DESIGN.md (limited to 'DESIGN.md') diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 0000000..dba3394 --- /dev/null +++ b/DESIGN.md @@ -0,0 +1,81 @@ +Scholscan Design +================= + +Article filter that learns from positive examples then filters RSS feeds automatically. Classifier uses TF-IDF on article titles plus logistic regression - fast, no content scraping needed. + +Code Structure +--------------- + +main.go - Entry point, validates commands, dispatches + +cmds/ + train.go - Load positive articles, fetch RSS as negatives, train model, output JSON + scan.go - Fetch articles from RSS, score with model, output filtered results + serve.go - HTTP server with background feed refresh, embedded web UI, RSS output + +core/ + types.go - Article struct holds article data, Config struct for app settings, Command interface + ml.go - TF-IDF implementation with n-gram support, logistic regression classifier + model.go - ModelEnvelope for serialized models, model save/load functions + scoring.go - Score conversion from raw 0-1 to display 1-10 scale + text.go - HTML content extraction, word tokenization, text cleaning + http.go - HTTP client with retries, timeouts, user agents + constants.go - Default timeouts, thresholds, chunk sizes + +Training Flow +------------- + +Command loads positive examples from JSONL file. Reads RSS URLs from text file (one per line, # comments allowed). Fetches RSS feeds in parallel, removes any articles matching positive URLs. Trains TF-IDF vectorizer then logistic regression on balanced dataset. Finds optimal threshold on validation split using Youden's J metric. Outputs complete model JSON to stdout. + +Scanning Flow +------------- + +Command fetches specified RSS feed, scores each article using trained model. Articles scoring above threshold output as JSON-Lines (same format as input). Includes enrichment metadata if available. Verbose mode shows fetch and scoring progress to stderr. + +Server Flow +----------- + +Server loads model and RSS world feed list on startup. Background goroutine refreshes all feeds in parallel every N minutes (configurable). Results cached in memory with RWMutex. HTTP handlers serve both HTML UI and JSON/RSS API endpoints. + +API Endpoints +------------- + +### HTML Pages +- GET `/` - Redirect to /live-feed +- GET `/live-feed` - Filtered articles web interface (server-rendered) +- GET `/tools` - Manual article scoring interface (server-rendered) + +### HTTP Handlers +- GET `/api/filtered/feed` - Articles as JSON array (for external consumption) +- GET `/api/health` - Health check returns {"status":"ok"} +- POST `/score` - Score single article via form post +- POST `/scan` - Scan RSS feed via form post + +### RSS Output +- GET `/api/filtered/rss` - Scored articles as RSS feed + +Model Details +------------- + +Vectorizer uses unigrams plus bigrams. Minimum document frequency 2 (removes typos), maximum 80% (removes stopwords). Vocabulary capped at 50000 terms. Logistic regression with L2 regularization lambda=0.001, learning rate 0.5, 500 iterations. Validation split 80/20 with seed 42 for reproducible results. Threshold selected using Youden's J to balance false positives against false negatives. + +Server Implementation +--------------------- + +HTML templates embedded in binary using embed.FS. All rendering is server-side with no JavaScript. Tools page uses standard HTML forms with POST submissions. Live feed displays cached background results with server-side rendering. Background refresh uses separate goroutine per feed. Results cached with last update time for each feed. RSS output repackages filtered articles into RSS format for consumption. + +Key Implementation Notes +------------------------ + +- Articles processed in 50-item chunks for memory efficiency +- File paths validated against directory traversal attacks +- HTTP requests use custom polite user agent with email contact +- RSS parsing handles both RSS and Atom via gofeed library +- TF-IDF vectorizer stores vocabulary as sorted string array for deterministic ordering +- Model version field allows future format changes +- Background refresh errors logged but don't crash server + +External Dependencies +--------------------- + +gofeed mmcdole for RSS/Atom parsing. All other functionality uses Go standard library only. -- cgit v1.2.3