aboutsummaryrefslogtreecommitdiff

ScholFetch

URL → Article metadata (JSONL) converter. Fetches title-only by default for speed.

Overview

ScholFetch extracts academic article metadata from URLs. It supports arXiv, Semantic Scholar, and generic HTML sources. The tool outputs structured JSONL format suitable for downstream processing by ScholScan (see below).

Usage

cat urls.txt | scholfetch > articles.jsonl
# or:
cat urls.txt | scholfetch --with-content > articles.jsonl

Monitoring Progress

ScholFetch writes a structured log file scholfetch.log during processing. Monitor it in another terminal:

tail -f scholfetch.log

Semantic Scholar API key

Get higher rate limits by setting your S2 API key (not required):

export S2_API_KEY="your-key-here"
cat urls.txt | scholfetch > articles.jsonl

Get your free key at: https://www.semanticscholar.org/product/api

ScholFetch will notify you on startup whether the key is detected.

Integration with ScholScan

Once you have structured article data, pipe it to ScholScan for ML-based filtering:

# Get articles from URLs
cat urls.txt | scholfetch > articles.jsonl

# Train a classification model
scholscan train articles.jsonl --rss-feeds feeds.txt > model.json

# Score articles from an RSS feed
scholscan scan --model model.json --url "https://example.com/feed.rss" > results.jsonl

ScholFetch extracts and enriches article metadata, while ScholScan handles classification. Together they provide a complete pipeline for filtering academic literature.

Input/Output

  • Input: URLs (one per line) on stdin
  • Output: JSONL with title and url fields (stdout)
  • Add --with-content for content field

How it works

URLs get routed by pattern (arXiv IDs → arXiv API, DOIs → Semantic Scholar, everything else → HTML scrape). Batched in chunks of 50 for efficiency. If batch fails, falls back to individual requests. Rate limited per API.

Code

  • main.go - reads stdin, sets up flags/output
  • routes.go - determines which handler (arxiv/s2/html) for each URL
  • processor.go - batching, fallback logic
  • arxiv.go, scholar.go, html.go - the actual extractors
  • client.go - HTTP client with retries and rate limiting

Build and Development

just build
just test

Roadmap

Future work could integrate crossref, pubmed quite easily (especially for title-only approach).