aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 9190dc1bfbe33a285d2ee4f24a244dbf62b75ba8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# ScholFetch

URL → Article metadata (JSONL) converter. Fetches title-only by default for speed.

## Overview

ScholFetch extracts academic article metadata from URLs.
It supports arXiv, Semantic Scholar, and generic HTML sources.
The tool outputs structured JSONL format suitable for downstream processing by ScholScan (see below).

## Usage
```bash
cat urls.txt | scholfetch > articles.jsonl
# or:
cat urls.txt | scholfetch --with-content > articles.jsonl
```

## Monitoring Progress

ScholFetch writes a structured log file `scholfetch.log` during processing. Monitor it in another terminal:

```bash
tail -f scholfetch.log
```

## Semantic Scholar API key

Get higher rate limits by setting your S2 API key (*not required*):

```bash
export S2_API_KEY="your-key-here"
cat urls.txt | scholfetch > articles.jsonl
```

Get your free key at: https://www.semanticscholar.org/product/api

ScholFetch will notify you on startup whether the key is detected.

## Integration with ScholScan

Once you have structured article data, pipe it to [ScholScan](https://git.samsci.com/scholscan) for ML-based filtering:

```bash
# Get articles from URLs
cat urls.txt | scholfetch > articles.jsonl

# Train a classification model
scholscan train articles.jsonl --rss-feeds feeds.txt > model.json

# Score articles from an RSS feed
scholscan scan --model model.json --url "https://example.com/feed.rss" > results.jsonl
```

ScholFetch extracts and enriches article metadata, while ScholScan handles classification. Together they provide a complete pipeline for filtering academic literature.

## Input/Output
- Input: URLs (one per line) on stdin
- Output: JSONL with `title` and `url` fields (stdout)
- Add `--with-content` for `content` field

## How it works

URLs get routed by pattern (arXiv IDs → arXiv API, DOIs → Semantic Scholar, everything else → HTML scrape). 
Batched in chunks of 50 for efficiency. If batch fails, falls back to individual requests. Rate limited per API.

## Code

- `main.go` - reads stdin, sets up flags/output
- `routes.go` - determines which handler (arxiv/s2/html) for each URL
- `processor.go` - batching, fallback logic
- `arxiv.go`, `scholar.go`, `html.go` - the actual extractors
- `client.go` - HTTP client with retries and rate limiting

## Build and Development

```bash
just build
just test
```

## Roadmap

Future work could integrate crossref, pubmed quite easily (especially for title-only approach).