Maps an OpenAlex-like corpus (Arrow Dataset/Table or data.frame/tibble) to
CSL JSON items and writes them into chunked files. The function creates the
directory output (if not present) and writes files chunk_1.json,
chunk_2.json, ... inside that directory.
Arguments
- project_dir
Optional path to project directory. If provided, used to set default values for
corpusandoutputparameters. Can be omitted ifcorpusandoutputare specified explicitly.- corpus
Path to parquet dataset, parquet Dataset/Table (e.g., from
arrow::open_dataset()) or a data.frame/tibble (e.g., fromdplyr::collect()).- output
Path to a directory to create and populate with chunked CSL JSON files (
chunk_1.json,chunk_2.json, ...).- chunk_size
Rows processed per chunk via DuckDB. Default: 10000.
- overwrite
Overwrite
outputif it exists. Default: FALSE.- verbose
Print progress messages. Default: TRUE.
Details
This converter targets the most common OpenAlex field layout and is resilient
to missing columns by falling back to NULL/empty values in SQL. Mapping
includes: title, year, DOI, container-title (venue), volume/issue/pages,
authors (with basic given/family split and ORCID when present), URL/abstract,
publisher and ISSN, language, keywords (collapsed to a single string), and an
aggregated note with OA status and citation count. Records are processed in
DuckDB-backed chunks for low memory usage.