aboutsummaryrefslogtreecommitdiff
path: root/DESIGN.md
diff options
context:
space:
mode:
authorSam Scholten2025-12-15 19:34:17 +1000
committerSam Scholten2025-12-15 19:34:59 +1000
commit9f5978186ac3de07f4325975fecf4f538fe713b6 (patch)
tree41440b703054fe59eb561ba81d80fd60380c1f7a /DESIGN.md
downloadscholscan-9f5978186ac3de07f4325975fecf4f538fe713b6.tar.gz
scholscan-9f5978186ac3de07f4325975fecf4f538fe713b6.zip
Init v0.1.0
Diffstat (limited to 'DESIGN.md')
-rw-r--r--DESIGN.md81
1 files changed, 81 insertions, 0 deletions
diff --git a/DESIGN.md b/DESIGN.md
new file mode 100644
index 0000000..dba3394
--- /dev/null
+++ b/DESIGN.md
@@ -0,0 +1,81 @@
+Scholscan Design
+=================
+
+Article filter that learns from positive examples then filters RSS feeds automatically. Classifier uses TF-IDF on article titles plus logistic regression - fast, no content scraping needed.
+
+Code Structure
+---------------
+
+main.go - Entry point, validates commands, dispatches
+
+cmds/
+ train.go - Load positive articles, fetch RSS as negatives, train model, output JSON
+ scan.go - Fetch articles from RSS, score with model, output filtered results
+ serve.go - HTTP server with background feed refresh, embedded web UI, RSS output
+
+core/
+ types.go - Article struct holds article data, Config struct for app settings, Command interface
+ ml.go - TF-IDF implementation with n-gram support, logistic regression classifier
+ model.go - ModelEnvelope for serialized models, model save/load functions
+ scoring.go - Score conversion from raw 0-1 to display 1-10 scale
+ text.go - HTML content extraction, word tokenization, text cleaning
+ http.go - HTTP client with retries, timeouts, user agents
+ constants.go - Default timeouts, thresholds, chunk sizes
+
+Training Flow
+-------------
+
+Command loads positive examples from JSONL file. Reads RSS URLs from text file (one per line, # comments allowed). Fetches RSS feeds in parallel, removes any articles matching positive URLs. Trains TF-IDF vectorizer then logistic regression on balanced dataset. Finds optimal threshold on validation split using Youden's J metric. Outputs complete model JSON to stdout.
+
+Scanning Flow
+-------------
+
+Command fetches specified RSS feed, scores each article using trained model. Articles scoring above threshold output as JSON-Lines (same format as input). Includes enrichment metadata if available. Verbose mode shows fetch and scoring progress to stderr.
+
+Server Flow
+-----------
+
+Server loads model and RSS world feed list on startup. Background goroutine refreshes all feeds in parallel every N minutes (configurable). Results cached in memory with RWMutex. HTTP handlers serve both HTML UI and JSON/RSS API endpoints.
+
+API Endpoints
+-------------
+
+### HTML Pages
+- GET `/` - Redirect to /live-feed
+- GET `/live-feed` - Filtered articles web interface (server-rendered)
+- GET `/tools` - Manual article scoring interface (server-rendered)
+
+### HTTP Handlers
+- GET `/api/filtered/feed` - Articles as JSON array (for external consumption)
+- GET `/api/health` - Health check returns {"status":"ok"}
+- POST `/score` - Score single article via form post
+- POST `/scan` - Scan RSS feed via form post
+
+### RSS Output
+- GET `/api/filtered/rss` - Scored articles as RSS feed
+
+Model Details
+-------------
+
+Vectorizer uses unigrams plus bigrams. Minimum document frequency 2 (removes typos), maximum 80% (removes stopwords). Vocabulary capped at 50000 terms. Logistic regression with L2 regularization lambda=0.001, learning rate 0.5, 500 iterations. Validation split 80/20 with seed 42 for reproducible results. Threshold selected using Youden's J to balance false positives against false negatives.
+
+Server Implementation
+---------------------
+
+HTML templates embedded in binary using embed.FS. All rendering is server-side with no JavaScript. Tools page uses standard HTML forms with POST submissions. Live feed displays cached background results with server-side rendering. Background refresh uses separate goroutine per feed. Results cached with last update time for each feed. RSS output repackages filtered articles into RSS format for consumption.
+
+Key Implementation Notes
+------------------------
+
+- Articles processed in 50-item chunks for memory efficiency
+- File paths validated against directory traversal attacks
+- HTTP requests use custom polite user agent with email contact
+- RSS parsing handles both RSS and Atom via gofeed library
+- TF-IDF vectorizer stores vocabulary as sorted string array for deterministic ordering
+- Model version field allows future format changes
+- Background refresh errors logged but don't crash server
+
+External Dependencies
+---------------------
+
+gofeed mmcdole for RSS/Atom parsing. All other functionality uses Go standard library only.