Command: index
Build a parquet lookup index equivalent to R build_corpus_index().
Default dataset is all (build indexes for all datasets). Use --dataset <name> to limit to one dataset.
When running with --dataset all, existing per-dataset index files are skipped and missing ones are built. If --index-file is supplied, it is ignored in all mode.
Usage
openalex-snapshot index \
--root-dir /data \
--dataset works
Output columns
idid_blockparquet_filefile_row_number
Tuning
index reads parquet files (already-converted output of convert), so its memory needs are modest compared to convert. The default tuning is fine for any host with ≥4 GB RAM. Use --max-memory-mb <N> and --workers <N> only if you need to constrain resources.
Note: the legacy
--profile safe|balanced|fastflag is accepted on this command for backwards compatibility but doesn't drive stratified parallelism. The stratified profile system (safe,stratified-36, customstratified-Nfromopenalex-snapshot.performance.yaml) currently applies only toconvertandrepair_convert. Seeconvert.md.