Scraper Pipeline¶
OpenTech-DB includes an automated data acquisition pipeline that searches academic databases and grey-literature sources for energy technology parameters, extracts structured values, and queues them as candidates for admin review before merging into the main catalogue.
Architecture Overview¶
┌─────────────────────────────────────────────────────────────────────┐
│ ScrapingPipeline.run() │
│ │
│ ┌────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Sources │───▶│ Extractors │───▶│ Normalizer │ │
│ │ (scrapers/) │ │ text / PDF │ │ → candidates │ │
│ │ OpenAlex │ │ LLM (optional) │ │ flat schema │ │
│ │ Sem. Scholar │ └─────────────────┘ └────────┬────────┘ │
│ │ NREL ATB │ │ │
│ │ Crossref │ ┌─────────────────────▼─────────┐ │
│ │ arXiv … │ │ Storage │ │
│ └────────────────┘ │ Supabase (primary) │ │
│ │ File fallback (data/scraped/) │ │
│ └───────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
The pipeline is orchestrated by APScheduler and runs automatically twice per month (1st and 15th at 02:00 UTC). It can also be triggered manually via the admin API or CLI.
Components¶
ScrapingPipeline (scrapers/pipeline.py)¶
Central orchestrator that:
- Loads
ScraperConfigfromscraper_config.yaml. - Instantiates enabled source scrapers.
- For each enabled technology × source combination, fetches
PaperRecordobjects. - Passes papers through
TextExtractor(and optionallyPDFExtractor+LLMExtractor). - Runs
Normalizerto build candidate instances. - Writes candidates to
Storage. - Emits live run events consumed by the
/scraper/statusendpoint.
Starting a run:
from scrapers.pipeline import ScrapingPipeline
pipeline = ScrapingPipeline.from_config()
# Full run — all sources × all technologies
result = pipeline.run()
# Selective run — only specified technologies
result = pipeline.run(tech_ids=["ccgt", "solar_pv_utility"])
# Selective run — only specified sources
result = pipeline.run(sources=["open_alex", "nrel_atb"])
BaseScraper (scrapers/base.py)¶
Abstract base class shared by all source scrapers. Provides:
- HTTP client —
httpx.AsyncClientwith configurabletimeout_secondsandmax_retries. - Rate limiting — configurable
rate_limit_delay(default 1.5 s) between requests. - Disk cache — optional HTTP response cache with
cache_ttl_hours(default 24 h). Avoids redundant API calls on repeated runs. - Exponential back-off — on HTTP 429 / 503 responses.
- Robots.txt compliance — respects crawl delays.
- Polite User-Agent — includes
contact_emailfrom config (required for OpenAlex polite pool: 10 req/s instead of 1 req/s).
Sources (scrapers/sources/)¶
Each source scraper inherits from BaseScraper and implements a search(technology_id, queries) method returning a list of PaperRecord objects.
| Source file | Source | Key details |
|---|---|---|
open_alex.py |
OpenAlex | 250 M+ scholarly works; free; polite pool 10 req/s |
semantic_scholar.py |
Semantic Scholar | AI-powered paper search; 200 M+ papers; free |
scopus_api.py |
Elsevier Scopus | Premium; requires SCOPUS_API_KEY + institutional token |
scholarly_gs.py |
Google Scholar | Web scraping via scholarly; fragile; disabled by default |
nrel_atb.py |
NREL ATB | Annual Technology Baseline; authoritative US cost data |
crossref.py |
Crossref | DOI metadata; fast free API |
arxiv_source.py |
arXiv | OAI-PMH + REST; preprints in energy & engineering |
europe_pmc.py |
Europe PMC | European biomedical + energy papers |
Enabling/disabling sources — edit scraper_config.yaml:
sources:
open_alex:
enabled: true
max_results_per_tech: 20
lookback_months: 12
scopus:
enabled: false # disabled — requires API key
api_key: ""
Extractors (scrapers/extractors/)¶
After a source returns paper records, extractors parse the text and pull structured parameter values.
TextExtractor (regex-based)¶
- Scans paper title, abstract, and full text using technology-specific regex patterns.
- Parameters it can extract:
capex,opex_fixed,opex_var,efficiency,lifetime,co2_emissions,capacity,degradation_rate. - Output:
ExtractedValuewithvalue,unit,context(surrounding sentence),confidence(0–1). - Unit conversion: handles EUR → USD, magnitude words (million, billion, thousand), and common unit variants.
PDFExtractor (optional, requires pdfplumber)¶
- Downloads open-access PDFs from paper URLs.
- Extracts raw text for downstream processing by
TextExtractororLLMExtractor.
LLMExtractor (optional, requires OPENAI_API_KEY or ANTHROPIC_API_KEY)¶
- Sends paper abstract / full text to GPT or Claude with a structured extraction prompt.
- Returns
LLMExtractedParams— a dict of parameter → value with per-field confidence scores. - Extraction priority: LLM > Regex > omit field.
Configure the LLM model in scraper_config.yaml:
extraction:
llm_enabled: false # set true to enable
llm_model: "gpt-4o-mini" # or "claude-3-haiku-20240307"
confidence_threshold: 0.6 # minimum confidence to accept an extracted value
Normalizer (scrapers/normalizer.py)¶
Converts raw extracted values into a flat catalogue-format instance dict:
- Merges LLM and regex outputs (LLM wins on conflict if confidence > threshold).
- Builds a deterministic
instance_idslug. - Fills only fields with extracted data — no defaults are invented.
- Attaches
confidence,context, andsourceto each extracted field. - Infers
country_iso2andcountryfrom paper text (regex + country name mapping).
Output is a proposed_instance dict that matches the flat catalogue schema and can be merged directly into data/<category>/<category>_technologies.json without post-processing.
Storage (scrapers/storage.py)¶
Persists scraper candidates with deduplication.
Backend selection:
| Condition | Backend |
|---|---|
SUPABASE_URL + SUPABASE_SERVICE_ROLE_KEY set |
Supabase scraper_candidates + scraper_runs tables |
| Environment variables not set | Local files under data/scraped/candidates/ |
Candidate statuses: pending → approved / rejected
Candidate schema (abbreviated):
{
"candidate_id": "uuid4",
"scraped_at": "2026-05-01T02:13:00Z",
"status": "pending",
"technology_id": "ccgt",
"source": "open_alex",
"paper_doi": "10.1016/j.energy.2024.01.001",
"paper_title": "Cost analysis of combined cycle gas turbines in Europe",
"paper_year": 2024,
"paper_venue": "Energy",
"extracted_params": {
"capex_usd_per_kw": {
"value": 870.0,
"unit": "USD/kW",
"context": "...capital costs of 870 USD/kW were reported...",
"confidence": 0.91
}
},
"proposed_instance": { ... }
}
Scheduler (scrapers/scheduler.py)¶
Wraps APScheduler with a SQLite job store (data/scraped/scheduler.db) so scheduled runs survive application restarts.
Default schedule (configurable in scraper_config.yaml):
schedule:
enabled: true
jobs:
- id: "scrape_run_1"
cron: "0 2 1 * *" # 1st of month, 02:00 UTC
- id: "scrape_run_2"
cron: "0 2 15 * *" # 15th of month, 02:00 UTC
The scheduler is started automatically on FastAPI application startup (lifespan context manager in main.py) and stopped on shutdown.
CLI¶
The scraper can also be invoked from the command line without starting the FastAPI server:
# Run the full pipeline
python -m scrapers.cli run
# Run only for specific technologies
python -m scrapers.cli run --tech ccgt --tech solar_pv_utility
# Run only specific sources
python -m scrapers.cli run --source open_alex --source nrel_atb
# List candidates by status
python -m scrapers.cli candidates --status pending
# Approve a candidate by ID
python -m scrapers.cli approve <candidate_id>
Admin API¶
The scraper pipeline is managed via the /scraper endpoints (admin JWT required). See API Reference for full details.
Adding support for a new source¶
- Create
scrapers/sources/my_source.pyinheriting fromBaseScraper. - Implement
def search(self, technology_id: str, queries: list[str]) -> list[PaperRecord]. - Register in
scrapers/pipeline.py_SOURCE_CLASSESdict. - Add a config block in
scraper_config.yamlundersources:.
# scrapers/sources/my_source.py
from scrapers.base import BaseScraper, PaperRecord
class MySourceScraper(BaseScraper):
source_name = "my_source"
def search(self, technology_id: str, queries: list[str]) -> list[PaperRecord]:
records = []
for query in queries:
# ... HTTP call, parse response ...
records.append(PaperRecord(
source_name="my_source",
source_id="unique-id",
title="...",
year=2024,
doi="10.xxxx/...",
abstract="...",
))
return records
Configuration reference¶
All scraper behaviour is controlled by scraper_config.yaml. Key sections:
http:
timeout_seconds: 30
max_retries: 3
retry_backoff_seconds: 2.0
rate_limit_delay: 1.5 # seconds between API calls
cache_enabled: true
cache_ttl_hours: 24
contact_email: opentech-db@th-deg.de # for OpenAlex polite pool
extraction:
llm_enabled: false
llm_model: "gpt-4o-mini"
confidence_threshold: 0.6
output:
backend: "supabase" # or "filesystem"
filesystem_base: "data/scraped"
Technology-specific search queries are configured under technologies::