# ScholFetch URL → Article metadata (JSONL) converter. Fetches title-only by default for speed. ## Overview ScholFetch extracts academic article metadata from URLs. It supports arXiv, Semantic Scholar, and generic HTML sources. The tool outputs structured JSONL format suitable for downstream processing by ScholScan (see below). ## Usage ```bash cat urls.txt | scholfetch > articles.jsonl # or: cat urls.txt | scholfetch --with-content > articles.jsonl ``` ## Monitoring Progress ScholFetch writes a structured log file `scholfetch.log` during processing. Monitor it in another terminal: ```bash tail -f scholfetch.log ``` ## Semantic Scholar API key Get higher rate limits by setting your S2 API key (*not required*): ```bash export S2_API_KEY="your-key-here" cat urls.txt | scholfetch > articles.jsonl ``` Get your free key at: https://www.semanticscholar.org/product/api ScholFetch will notify you on startup whether the key is detected. ## Integration with ScholScan Once you have structured article data, pipe it to [ScholScan](https://git.samsci.com/scholscan) for ML-based filtering: ```bash # Get articles from URLs cat urls.txt | scholfetch > articles.jsonl # Train a classification model scholscan train articles.jsonl --rss-feeds feeds.txt > model.json # Score articles from an RSS feed scholscan scan --model model.json --url "https://example.com/feed.rss" > results.jsonl ``` ScholFetch extracts and enriches article metadata, while ScholScan handles classification. Together they provide a complete pipeline for filtering academic literature. ## Input/Output - Input: URLs (one per line) on stdin - Output: JSONL with `title` and `url` fields (stdout) - Add `--with-content` for `content` field ## How it works URLs get routed by pattern (arXiv IDs → arXiv API, DOIs → Semantic Scholar, everything else → HTML scrape). Batched in chunks of 50 for efficiency. If batch fails, falls back to individual requests. Rate limited per API. ## Code - `main.go` - reads stdin, sets up flags/output - `routes.go` - determines which handler (arxiv/s2/html) for each URL - `processor.go` - batching, fallback logic - `arxiv.go`, `scholar.go`, `html.go` - the actual extractors - `client.go` - HTTP client with retries and rate limiting ## Build and Development ```bash just build just test ``` ## Roadmap Future work could integrate crossref, pubmed quite easily (especially for title-only approach).