Changelog
Source:NEWS.md
openalexPro 0.9.0
New Features
Rust backend via
extendr. Core functions now delegate to a compiled Rust library (openalex-corev0.5.0) for JSON→Parquet conversion, schema inference, corpus indexing, and ID-based record lookup. Pure-R_Rvariants remain as fallbacks. This eliminates the externalopenalex-snapshotbinary dependency for the main pipeline.pro_rate_limit_status()— query your OpenAlex API rate-limit status (daily budget, used, remaining, prepaid balance, reset time, per-endpoint costs). Returns a list invisibly; prints a formatted summary whenverbose = TRUE.New debug option
openalexPro.ratelimit_check: when set toTRUEviaoptions(openalexPro.ratelimit_check = TRUE), every API call prints the current rate-limit status (budget, usage, remaining, reset time) as a message before the request is sent. Internally handled inapi_call()usingpro_rate_limit_status(verbose = TRUE). A recursion guard temporarily disables the option during the nested rate-limit request.
openalexPro 0.8.1
Bug Fixes
pro_request()list method now respects theoverwriteparameter. Previously, whenquery_urlwas a list, the top-leveloutputdirectory was neither checked nor deleted regardless ofoverwrite. It now errors if the directory exists andoverwrite = FALSE, and deletes it upfront ifoverwrite = TRUE.pro_fetch()now deletes all three subdirectories (json,jsonl,parquet) upfront before the pipeline starts whenoverwrite = TRUE, rather than delegating deletion to each sub-function individually. If any of the subdirectories exist andoverwrite = FALSE, the function now errors immediately with a clear message listing which directories already exist.
openalexPro 0.8.0
New Features
-
pro_request(),pro_request_jsonl(), andpro_request_jsonl_parquet()now accept nested lists of query URLs. Each nesting level is preserved as a subdirectory in the output, and the parquet stage converts directory depth into hive-style partition keys: depth 1 →query=<name>, depth 2 →query_l2=<name>, depth 3 →query_l3=<name>, etc. The resulting dataset is readable witharrow::open_dataset()and the partition columns appear as regular columns.pro_fetch()inherits this behaviour automatically. A new internal helpercollect_leaf_queries()performs the recursive list flattening.
openalexPro 0.7.0
Breaking Changes
snapshot_to_parquet()has a new signature. The oldsnapshot_dirandparquet_dirparameters are replaced by a singleroot_dirparameter that matches the directory layout used by the companionopenalex-snapshotRust binary. The function now delegates to the binary rather than performing conversion in R. Migration: replacesnapshot_to_parquet(snapshot_dir = "...", parquet_dir = "...")withsnapshot_to_parquet(root_dir = "...").build_corpus_index()has a new signature. The oldcorpus_dirparameter is replaced byroot_dir. The function now delegates to theopenalex-snapshotbinary. Migration: replacebuild_corpus_index(corpus_dir = "...")withbuild_corpus_index(root_dir = "...").lookup_by_id()has a new signature. The oldindex_fileandoutputparameters are replaced byroot_dirandproject_dir(consistent with the project-folder convention used bypro_request()andpro_fetch()). The function now delegates to theopenalex-snapshotbinary. Migration: replacelookup_by_id(index_file = "...", output = "...")withlookup_by_id(root_dir = "...", project_dir = "...").
New Features
-
Pure-R / DuckDB fallback variants are now exported as separate functions:
-
snapshot_to_parquet_R()— original R implementation of snapshot conversion (uses DuckDB + arrow, no external binary required) -
build_corpus_index_R()— original R implementation of index building -
lookup_by_id_R()— original R implementation of ID-based record lookup
These retain the original parameter names and are useful when the
openalex-snapshotbinary is unavailable. -
-
find_oas_binary()andrun_oas()are exported internal helpers for resolving and invoking theopenalex-snapshotbinary. They support:- Explicit
oas_binargument options(openalexPro.oas_bin = "/path/to/binary")- PATH search via
Sys.which("openalex-snapshot")
- Explicit
inst/Makefile.snapshotupdated to use theopenalex-snapshotbinary directly (replacingRscriptinvocations of the now-renamed R functions).
Dependencies
- The
openalex-snapshotRust binary is now required forsnapshot_to_parquet(),build_corpus_index(), andlookup_by_id(). Download from https://github.com/rkrug/openalex-snapshot/releases or build withcargo build --release. The pure-R*_R()variants have no binary dependency.
openalexPro 0.6.1
Bug Fixes
- Manual add the
idfield to theopt_select_names()as it is missing from the returned list from OpenAlex
Changes
Normalized
api_keyhandling across API-calling functions:pro_request(),pro_fetch(),pro_count(), andpro_download_content()now acceptapi_key = NULLorapi_key = "". In that case, requests are sent without an API key (subject to OpenAlex’s unauthenticated limits).Added explicit
api_keytype validation in API-calling functions. Accepted inputs are now limited toNULLor a length-1 character string.Updated
pro_rate_limit_status()to handleapi_key = NULLsafely (informational message +FALSEreturn), and aligned documentation.
Testing and Tooling
Added opt-in live API contract tests (
tests/testthat/test-900-live_api_contracts.R) gated byOPENALEXPRO_LIVE_TESTS=trueand a non-dummyopenalexPro.apikey.Added
inst/scripts/record_cassettes.Rand recording safeguards to prevent accidental re-recording with invalid credentials.Reduced warning noise in test runs by cleaning up deprecated-search warning handling and removing unused cassette hooks.
openalexPro 0.6.0
New Features
Added
pro_rate_limit_status()to query the OpenAlex rate-limit endpoint (GET /rate-limit). Returns the full rate-limit JSON invisibly (daily budget, used, remaining, prepaid balance, per-endpoint costs, reset time). Prints a human-readable summary viamessage()whenverbose = TRUE(the default). ReturnsFALSEfor a missing or invalid API key, andNULLon a network error, so callers can distinguish auth problems from transient failures.pro_validate_credentials()refactored to usepro_rate_limit_status()internally instead of making a separatepro_count()request. Behaviour and return value are unchanged.Added
pro_download_content()to download full-text PDFs (format = "pdf") or TEI XML (format = "grobid-xml") from the OpenAlex content endpoint (content.openalex.org). Accepts a vector of work IDs, supports parallel downloads viaworkers, and returns a data frame with per-file status ("ok"/"not_found"/"error"). Note: content downloads cost $0.01 per file.-
Added
search.exactandsearch.semanticparameters topro_query(), matching the new OpenAlex search API:-
search.exact: searches without stemming or stop-word removal; supports boolean operators, quoted phrases, proximity (~N), and wildcards. -
search.semantic: AI embedding-based search that matches by conceptual meaning rather than keywords (max 50 results, max 1 req/sec). -
search: now documented to support the full boolean/phrase/wildcard syntax in addition to its existing stemmed matching.
-
Exported
infer_json_schema()for direct use. Infers a unified DuckDB columns clause from a set of JSON/NDJSON files via per-fileDESCRIBEqueries with type-widening and optional two-level disk caching (schema_cache_dir).
Internal Changes
-
pro_rate_limit_status()andpro_download_content()now route their HTTP requests through the internalapi_call()helper, unifying retry logic and error handling across all real API call sites.suppressMessages()is used to suppressapi_call()’s internal logging so each function emits its own user-facing messages.pro_download_content()now also sends aUser-Agentheader (previously omitted).
Deprecations
- Filter arguments with a
.searchsuffix (e.g.title_and_abstract.search = "...") are deprecated by the OpenAlex API. They still work but now emit a warning. Use thesearchparameter ofpro_query()instead:pro_query(entity = "works", search = "your terms"). See https://developers.openalex.org/guides/searching for details.
Bug Fixes
-
Fixed Windows path-normalization failures in
snapshot_to_parquet(),build_corpus_index(),lookup_by_id(), andpro_request_jsonl_parquet(). On Windows,normalizePath()can return 8.3 short names (e.g.RUNNER~1) fortempdir()-derived paths whilelist.files()and DuckDB resolve to long names (runneradmin). Resume detection insnapshot_to_parquet()used%in%on paths with mixed separators (\vs/), causing already-converted files to be reconverted.build_corpus_index()embeddedsnapshot_dir(with\) inside a DuckDBregexp_replacepattern, which never matched — so the full absolute path was stored in the index and later doubled bylookup_by_id().pro_request_jsonl_parquet()usednormalizePathstring comparison to detect subdirectories, which always failed, placing every output file in a spuriousquery=<dirname>subdirectory.Fixes: (1) normalize separators to
/withgsub("\\\\", "/", ...)on both sides of%in%comparisons; (2) compute relative paths in R using path-depth counting (strsplit(path, "/")then indexed extraction) rather than string-matching absolute paths — immune to 8.3 vs long-name differences;- pass the relative path as a SQL literal in
build_corpus_index()instead of computing it inside DuckDB with a regex.
- pass the relative path as a SQL literal in
Changes
- Schema cache per-file CSVs renamed from
%06d_<basename>.schema.csvto<update_date>_<part_name>.csv(e.g.2024-01-15_part_001.csv), making each cache file directly traceable to its source.gz.
Breaking Changes
- Removed
mailtoparameter from all API functions (pro_request(),pro_fetch(),pro_count(),pro_validate_credentials()). OpenAlex no longer uses email addresses for polite-pool access. -
api_keyhandling was tightened in 0.6.0 forpro_request(),pro_fetch(), andpro_count().
Note: this was later relaxed again in development; current development allowsapi_key = NULL/""and runs in unauthenticated mode. - Simplified User-Agent string from
openalexPro v[VERSION] (mailto:[EMAIL])toopenalexPro/[VERSION].
openalexPro 0.5.0
New Features
Snapshot Handling
- Added
prepare_snapshot()function for setting up a directory with Makefile and documentation for managing OpenAlex snapshots. - Added
Makefile.snapshotininst/for automating snapshot download, conversion, and indexing. Includes targets forsnapshot,parquet,parquet_index, and automatic renaming of existing data with release dates. - Added
snapshot_to_parquet()function for converting OpenAlex snapshot NDJSON files to Parquet format using DuckDB. Processes each.gzfile individually with per-file resume support. Supports parallel processing viaworkers(usingfuture_lapply()) and unified schema inference viasample_size. - Added
build_corpus_index()function for creating memory-efficient Parquet indexes for fast ID lookups. Handles 300M+ records by processing parquet files individually, with optional parallelization viaworkersand progress reporting viaprogressr. The index file is auto-named and placed alongside the corpus directory. - Added
lookup_by_id()function for fast record retrieval from a parquet corpus using pre-built indexes. Uses Arrow for index filtering with automatic ID normalization. Supports parallel reads viaworkersand streaming to parquet viaoutputfor millions of IDs without loading into memory. - Added
snapshot_filter_ids()function for filtering snapshot data by ID lists. - Added
id_block()helper function for computing ID block partitions.
Documentation
- Added
snapshot.qmdvignette with comprehensive guide on downloading, converting, and querying OpenAlex snapshots locally.
Changes
- Refactored
snapshot_to_parquet()to process each.gzfile individually instead of all at once. This reduces memory usage, enables per-file resume on interruption, and shows progress with ETA. Theworkersparameter now controls parallelfutureworkers instead of DuckDB threads. Addedsample_sizeparameter for schema inference. - Extracted
infer_json_schema()andconvert_json_to_parquet()internal helpers, shared by bothsnapshot_to_parquet()andpro_request_jsonl_parquet(). - Refactored
pro_request_jsonl_parquet()to per-file conversion withfuture_lapply()parallelization. Removes hive partitioning bypage; subfolder structure is preserved directly. Addedworkersparameter. Removedprogressparameter (replaced byprogressr).
Bug Fixes
- Fixed vignette parse errors in
pro_query.qmd(malformed code block closings). - Fixed out-of-memory crash in
snapshot_to_parquet()whensample_sizeexceeded the number of available files (e.g.sample_size = 10000with 1981 works files). Schema inference now processes one file at a time instead of a single bulk DuckDB query. - Fixed
duplicate key "as"crash when converting theworksdataset.abstract_inverted_indexis now stored asVARCHAR(raw JSON string) rather than aSTRUCT. DuckDB folds struct field names to lowercase, causing a collision between the valid JSON keys"as"and"As"in this field. Storing asVARCHARavoids struct parsing entirely and preserves the data. Parse individual values withjsonlite::fromJSON()when needed. - Fixed DuckDB temp file IO errors during
snapshot_to_parquet()by exposing aTEMP_DIRvariable inMakefile.snapshot(default/tmp).
Changes
-
snapshot_to_parquet()schema inference now runs one DuckDBDESCRIBEper file instead of a single query across all sampled files. Results are cached in<parquet_ds>/.schema_cache/: per-file CSVs (<update_date>_<part_name>.csv) enable mid-run resume; a unifiedunified_schema.csvis loaded on subsequent runs to skip inference entirely. Deleteunified_schema.csvto force re-inference.
Tests
- Added comprehensive tests for
snapshot_to_parquet(),build_corpus_index(), andlookup_by_id(). - Added tests for schema caching, unified schema reuse, and works
abstract_inverted_indexVARCHAR round-trip.
openalexPro 0.4.1
- Standardised progressbar handling
- Changed default pages from 1,000 to 10,000
- Refactored
pro_queryand removedmultiple_idsargument using Claude and expanded tests and added vignette. - Added creation of
00_completedin output directory ofjson,jsonlandparquetfolders upon successful completion - Changed api key and email handling. Removed oap_mail()_ and oap_apikey() and simplified handling of api key and email to only use environmental variables
openalexPro.emailandopenalexPro.apikey - Added unified schema inference to
pro_request_jsonl_parquet()to prevent schema conflicts when reading combined Parquet datasets. Newsample_sizeparameter controls schema inference sampling. This fixes “Unsupported cast from string to struct” errors when fields have different types across JSONL files (e.g.,apc_paidbeingnullin some files and a struct in others). - Removed
harmonize_parquet_schemata()as it is no longer needed with the new unified schema inference. - Increased default n umber of pages to be read by
request_json()from 1000 to 10000 to allow the initially planned 2,000,000 work download.
openalexPro 0.4.0
CI and coverage tweaks for CRAN readiness.
splitting snowball functionality into openalexSnowball
openalexPro 0.3.1
- Added
pro_fetch()withproject_foldersupport for structured outputs. - Added progress reporting and parallelization for
pro_request_jsonl(). - Added
sample_parquet_n()random sampling utilities withselectsupport. - Improved
count_onlyoutput to return a data frame with an error column.
openalexPro 0.3.0
- Added
count_onlysupport forpro_request()and related helpers. - Added DOI handling improvements and API call fixes.
openalexPro 0.2.0
- Introduced
pro_query()as the package-native query builder with chunking. - Added snowball search utilities and citation edge extraction workflow.
- Expanded conversion pipeline tests and VCR-based API fixtures.
- Added
extract_doi()helpers and compatibility reporting artifacts.