aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorSam Scholten2025-12-15 19:35:46 +1000
committerSam Scholten2025-12-15 19:35:57 +1000
commit3562d2fd34bb98d29c7cf6e4d4130129a7bb24f2 (patch)
tree42b1f0e0a346a1cf087df90e29a100edbd66b3eb /README.md
downloadscholfetch-main.tar.gz
scholfetch-main.zip
Init v0.1.0HEADmain
Diffstat (limited to 'README.md')
-rw-r--r--README.md83
1 files changed, 83 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..9190dc1
--- /dev/null
+++ b/README.md
@@ -0,0 +1,83 @@
+# ScholFetch
+
+URL → Article metadata (JSONL) converter. Fetches title-only by default for speed.
+
+## Overview
+
+ScholFetch extracts academic article metadata from URLs.
+It supports arXiv, Semantic Scholar, and generic HTML sources.
+The tool outputs structured JSONL format suitable for downstream processing by ScholScan (see below).
+
+## Usage
+```bash
+cat urls.txt | scholfetch > articles.jsonl
+# or:
+cat urls.txt | scholfetch --with-content > articles.jsonl
+```
+
+## Monitoring Progress
+
+ScholFetch writes a structured log file `scholfetch.log` during processing. Monitor it in another terminal:
+
+```bash
+tail -f scholfetch.log
+```
+
+## Semantic Scholar API key
+
+Get higher rate limits by setting your S2 API key (*not required*):
+
+```bash
+export S2_API_KEY="your-key-here"
+cat urls.txt | scholfetch > articles.jsonl
+```
+
+Get your free key at: https://www.semanticscholar.org/product/api
+
+ScholFetch will notify you on startup whether the key is detected.
+
+## Integration with ScholScan
+
+Once you have structured article data, pipe it to [ScholScan](https://git.samsci.com/scholscan) for ML-based filtering:
+
+```bash
+# Get articles from URLs
+cat urls.txt | scholfetch > articles.jsonl
+
+# Train a classification model
+scholscan train articles.jsonl --rss-feeds feeds.txt > model.json
+
+# Score articles from an RSS feed
+scholscan scan --model model.json --url "https://example.com/feed.rss" > results.jsonl
+```
+
+ScholFetch extracts and enriches article metadata, while ScholScan handles classification. Together they provide a complete pipeline for filtering academic literature.
+
+## Input/Output
+- Input: URLs (one per line) on stdin
+- Output: JSONL with `title` and `url` fields (stdout)
+- Add `--with-content` for `content` field
+
+## How it works
+
+URLs get routed by pattern (arXiv IDs → arXiv API, DOIs → Semantic Scholar, everything else → HTML scrape).
+Batched in chunks of 50 for efficiency. If batch fails, falls back to individual requests. Rate limited per API.
+
+## Code
+
+- `main.go` - reads stdin, sets up flags/output
+- `routes.go` - determines which handler (arxiv/s2/html) for each URL
+- `processor.go` - batching, fallback logic
+- `arxiv.go`, `scholar.go`, `html.go` - the actual extractors
+- `client.go` - HTTP client with retries and rate limiting
+
+## Build and Development
+
+```bash
+just build
+just test
+```
+
+## Roadmap
+
+Future work could integrate crossref, pubmed quite easily (especially for title-only approach).