ScholFetch
URL → Article metadata (JSONL) converter. Fetches title-only by default for speed.
Overview
ScholFetch extracts academic article metadata from URLs. It supports arXiv, Semantic Scholar, and generic HTML sources. The tool outputs structured JSONL format suitable for downstream processing by ScholScan (see below).
Usage
cat urls.txt | scholfetch > articles.jsonl
# or:
cat urls.txt | scholfetch --with-content > articles.jsonl
Monitoring Progress
ScholFetch writes a structured log file scholfetch.log during processing. Monitor it in another terminal:
tail -f scholfetch.log
Semantic Scholar API key
Get higher rate limits by setting your S2 API key (not required):
export S2_API_KEY="your-key-here"
cat urls.txt | scholfetch > articles.jsonl
Get your free key at: https://www.semanticscholar.org/product/api
ScholFetch will notify you on startup whether the key is detected.
Integration with ScholScan
Once you have structured article data, pipe it to ScholScan for ML-based filtering:
# Get articles from URLs
cat urls.txt | scholfetch > articles.jsonl
# Train a classification model
scholscan train articles.jsonl --rss-feeds feeds.txt > model.json
# Score articles from an RSS feed
scholscan scan --model model.json --url "https://example.com/feed.rss" > results.jsonl
ScholFetch extracts and enriches article metadata, while ScholScan handles classification. Together they provide a complete pipeline for filtering academic literature.
Input/Output
- Input: URLs (one per line) on stdin
- Output: JSONL with
titleandurlfields (stdout) - Add
--with-contentforcontentfield
How it works
URLs get routed by pattern (arXiv IDs → arXiv API, DOIs → Semantic Scholar, everything else → HTML scrape). Batched in chunks of 50 for efficiency. If batch fails, falls back to individual requests. Rate limited per API.
Code
main.go- reads stdin, sets up flags/outputroutes.go- determines which handler (arxiv/s2/html) for each URLprocessor.go- batching, fallback logicarxiv.go,scholar.go,html.go- the actual extractorsclient.go- HTTP client with retries and rate limiting
Build and Development
just build
just test
Roadmap
Future work could integrate crossref, pubmed quite easily (especially for title-only approach).
