Convert JSON files from pro_request() directly to Apache Parquet
Source:R/pro_request_parquet.R
pro_request_parquet.RdSingle-step replacement for the two-step
pro_request_jsonl_R() + pro_request_jsonl_parquet() pipeline.
Reads the JSON files written by pro_request() and converts each one to a
Parquet file using DuckDB, with no intermediate JSONL on disk.
Usage
pro_request_parquet(
input_json = NULL,
output = NULL,
add_columns = list(),
overwrite = FALSE,
verbose = TRUE,
progress = TRUE,
delete_input = FALSE,
sample_size = 1000,
workers = NULL,
enrich = TRUE,
schema = "auto"
)Arguments
- input_json
Directory of JSON files returned by
pro_request().- output
Output directory for the Parquet dataset.
- add_columns
Named list of scalar constant columns to embed in every output record (e.g.
list(query = "my_filter")). Values are embedded as SQL string literals; only character scalars are supported.- overwrite
Logical. Overwrite
outputif it already exists. DefaultFALSE.- verbose
Logical. Show progress messages. Default
TRUE.- progress
Logical. Show a progress bar. Default
TRUE.- delete_input
Logical. Delete
input_jsonafter a successful conversion. DefaultFALSE.- sample_size
Integer. Number of records per file passed to DuckDB's
sample_sizeoption during schema inference. Use-1to read all records (accurate but slow for large files). Default1000.- workers
Integer. Number of parallel workers.
NULLor1runs sequentially. DefaultNULL.- enrich
Logical. When
TRUE(the default) and the inferred schema containsabstract_inverted_index/authorships/publication_year, addabstractandcitationcomputed columns.- schema
Controls use of a pre-built baseline schema for type resolution. Possible values:
"auto"(default)Auto-detect the OpenAlex entity type from the inferred columns, then load the matching schema from the user cache (populated by
oa_cache_schema()) or the schemas bundled with the package. For each column where DuckDB runtime inference produced the ambiguousJSONfallback type, the baseline type is used instead. Falls back silently to runtime-only inference when the entity cannot be detected or no schema is found."none"orNULLSkip the baseline entirely; behaviour is identical to package versions before this feature was added.
- A file path
Path to a CSV with columns
col_name/col_type. Used directly as the baseline.- A directory path
Auto-detect entity, then look for
<entity>.csvinside that directory. Useful when pointing directly at a snapshot-metadata schemata directory.
Details
For works entities the function detects the presence of
abstract_inverted_index, authorships, and publication_year in the
inferred schema and, when enrich = TRUE (the default), adds two computed
columns:
abstract— plain text reconstructed fromabstract_inverted_index.citation—"Author (year)"/"A & B (year)"/"A et al. (year)".
These expressions are identical to those used by the openalex-snapshot CLI
binary, so the Parquet output matches the snapshot pipeline column for column.
File format
pro_request() writes one JSON file per API page. For paginated queries
each file has the structure {"results": [...], "meta": {...}}. For
group-by queries the array field is "group_by". For single-record lookups
the file is a bare JSON object. All three formats are handled automatically.
Output layout
The subdirectory structure of input_json is preserved, with hive-partition
naming (query=<name>/, query_l2=<name>/, …) so that Arrow/DuckDB can
read the result as a partitioned dataset. A page column is added to each
record with a value derived from the source filename (or subdirectory for
multi-query inputs).
See also
pro_request() to download the JSON files,
pro_request_jsonl_R() and pro_request_jsonl_parquet() for the older
two-step pipeline (now deprecated).