diff options
| author | Sam Scholten | 2025-12-15 19:35:46 +1000 |
|---|---|---|
| committer | Sam Scholten | 2025-12-15 19:35:57 +1000 |
| commit | 3562d2fd34bb98d29c7cf6e4d4130129a7bb24f2 (patch) | |
| tree | 42b1f0e0a346a1cf087df90e29a100edbd66b3eb /README.md | |
| download | scholfetch-main.tar.gz scholfetch-main.zip | |
Diffstat (limited to 'README.md')
| -rw-r--r-- | README.md | 83 |
1 files changed, 83 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..9190dc1 --- /dev/null +++ b/README.md @@ -0,0 +1,83 @@ +# ScholFetch + +URL → Article metadata (JSONL) converter. Fetches title-only by default for speed. + +## Overview + +ScholFetch extracts academic article metadata from URLs. +It supports arXiv, Semantic Scholar, and generic HTML sources. +The tool outputs structured JSONL format suitable for downstream processing by ScholScan (see below). + +## Usage +```bash +cat urls.txt | scholfetch > articles.jsonl +# or: +cat urls.txt | scholfetch --with-content > articles.jsonl +``` + +## Monitoring Progress + +ScholFetch writes a structured log file `scholfetch.log` during processing. Monitor it in another terminal: + +```bash +tail -f scholfetch.log +``` + +## Semantic Scholar API key + +Get higher rate limits by setting your S2 API key (*not required*): + +```bash +export S2_API_KEY="your-key-here" +cat urls.txt | scholfetch > articles.jsonl +``` + +Get your free key at: https://www.semanticscholar.org/product/api + +ScholFetch will notify you on startup whether the key is detected. + +## Integration with ScholScan + +Once you have structured article data, pipe it to [ScholScan](https://git.samsci.com/scholscan) for ML-based filtering: + +```bash +# Get articles from URLs +cat urls.txt | scholfetch > articles.jsonl + +# Train a classification model +scholscan train articles.jsonl --rss-feeds feeds.txt > model.json + +# Score articles from an RSS feed +scholscan scan --model model.json --url "https://example.com/feed.rss" > results.jsonl +``` + +ScholFetch extracts and enriches article metadata, while ScholScan handles classification. Together they provide a complete pipeline for filtering academic literature. + +## Input/Output +- Input: URLs (one per line) on stdin +- Output: JSONL with `title` and `url` fields (stdout) +- Add `--with-content` for `content` field + +## How it works + +URLs get routed by pattern (arXiv IDs → arXiv API, DOIs → Semantic Scholar, everything else → HTML scrape). +Batched in chunks of 50 for efficiency. If batch fails, falls back to individual requests. Rate limited per API. + +## Code + +- `main.go` - reads stdin, sets up flags/output +- `routes.go` - determines which handler (arxiv/s2/html) for each URL +- `processor.go` - batching, fallback logic +- `arxiv.go`, `scholar.go`, `html.go` - the actual extractors +- `client.go` - HTTP client with retries and rate limiting + +## Build and Development + +```bash +just build +just test +``` + +## Roadmap + +Future work could integrate crossref, pubmed quite easily (especially for title-only approach). |
