Skip to contents

openalexPro 0.9.0

New Features

  • Rust backend via extendr. Core functions now delegate to a compiled Rust library (openalex-core v0.5.0) for JSON→Parquet conversion, schema inference, corpus indexing, and ID-based record lookup. Pure-R _R variants remain as fallbacks. This eliminates the external openalex-snapshot binary dependency for the main pipeline.

  • pro_rate_limit_status() — query your OpenAlex API rate-limit status (daily budget, used, remaining, prepaid balance, reset time, per-endpoint costs). Returns a list invisibly; prints a formatted summary when verbose = TRUE.

  • New debug option openalexPro.ratelimit_check: when set to TRUE via options(openalexPro.ratelimit_check = TRUE), every API call prints the current rate-limit status (budget, usage, remaining, reset time) as a message before the request is sent. Internally handled in api_call() using pro_rate_limit_status(verbose = TRUE). A recursion guard temporarily disables the option during the nested rate-limit request.

openalexPro 0.8.1

Bug Fixes

  • pro_request() list method now respects the overwrite parameter. Previously, when query_url was a list, the top-level output directory was neither checked nor deleted regardless of overwrite. It now errors if the directory exists and overwrite = FALSE, and deletes it upfront if overwrite = TRUE.

  • pro_fetch() now deletes all three subdirectories (json, jsonl, parquet) upfront before the pipeline starts when overwrite = TRUE, rather than delegating deletion to each sub-function individually. If any of the subdirectories exist and overwrite = FALSE, the function now errors immediately with a clear message listing which directories already exist.

openalexPro 0.8.0

New Features

  • pro_request(), pro_request_jsonl(), and pro_request_jsonl_parquet() now accept nested lists of query URLs. Each nesting level is preserved as a subdirectory in the output, and the parquet stage converts directory depth into hive-style partition keys: depth 1 → query=<name>, depth 2 → query_l2=<name>, depth 3 → query_l3=<name>, etc. The resulting dataset is readable with arrow::open_dataset() and the partition columns appear as regular columns. pro_fetch() inherits this behaviour automatically. A new internal helper collect_leaf_queries() performs the recursive list flattening.

openalexPro 0.7.0

Breaking Changes

  • snapshot_to_parquet() has a new signature. The old snapshot_dir and parquet_dir parameters are replaced by a single root_dir parameter that matches the directory layout used by the companion openalex-snapshot Rust binary. The function now delegates to the binary rather than performing conversion in R. Migration: replace snapshot_to_parquet(snapshot_dir = "...", parquet_dir = "...") with snapshot_to_parquet(root_dir = "...").

  • build_corpus_index() has a new signature. The old corpus_dir parameter is replaced by root_dir. The function now delegates to the openalex-snapshot binary. Migration: replace build_corpus_index(corpus_dir = "...") with build_corpus_index(root_dir = "...").

  • lookup_by_id() has a new signature. The old index_file and output parameters are replaced by root_dir and project_dir (consistent with the project-folder convention used by pro_request() and pro_fetch()). The function now delegates to the openalex-snapshot binary. Migration: replace lookup_by_id(index_file = "...", output = "...") with lookup_by_id(root_dir = "...", project_dir = "...").

New Features

  • Pure-R / DuckDB fallback variants are now exported as separate functions:

    • snapshot_to_parquet_R() — original R implementation of snapshot conversion (uses DuckDB + arrow, no external binary required)
    • build_corpus_index_R() — original R implementation of index building
    • lookup_by_id_R() — original R implementation of ID-based record lookup

    These retain the original parameter names and are useful when the openalex-snapshot binary is unavailable.

  • find_oas_binary() and run_oas() are exported internal helpers for resolving and invoking the openalex-snapshot binary. They support:

    1. Explicit oas_bin argument
    2. options(openalexPro.oas_bin = "/path/to/binary")
    3. PATH search via Sys.which("openalex-snapshot")
  • inst/Makefile.snapshot updated to use the openalex-snapshot binary directly (replacing Rscript invocations of the now-renamed R functions).

Dependencies

openalexPro 0.6.1

Bug Fixes

  • Manual add the id field to the opt_select_names() as it is missing from the returned list from OpenAlex

Changes

  • Normalized api_key handling across API-calling functions: pro_request(), pro_fetch(), pro_count(), and pro_download_content() now accept api_key = NULL or api_key = "". In that case, requests are sent without an API key (subject to OpenAlex’s unauthenticated limits).

  • Added explicit api_key type validation in API-calling functions. Accepted inputs are now limited to NULL or a length-1 character string.

  • Updated pro_rate_limit_status() to handle api_key = NULL safely (informational message + FALSE return), and aligned documentation.

Testing and Tooling

  • Added opt-in live API contract tests (tests/testthat/test-900-live_api_contracts.R) gated by OPENALEXPRO_LIVE_TESTS=true and a non-dummy openalexPro.apikey.

  • Added inst/scripts/record_cassettes.R and recording safeguards to prevent accidental re-recording with invalid credentials.

  • Reduced warning noise in test runs by cleaning up deprecated-search warning handling and removing unused cassette hooks.

openalexPro 0.6.0

New Features

  • Added pro_rate_limit_status() to query the OpenAlex rate-limit endpoint (GET /rate-limit). Returns the full rate-limit JSON invisibly (daily budget, used, remaining, prepaid balance, per-endpoint costs, reset time). Prints a human-readable summary via message() when verbose = TRUE (the default). Returns FALSE for a missing or invalid API key, and NULL on a network error, so callers can distinguish auth problems from transient failures.

  • pro_validate_credentials() refactored to use pro_rate_limit_status() internally instead of making a separate pro_count() request. Behaviour and return value are unchanged.

  • Added pro_download_content() to download full-text PDFs (format = "pdf") or TEI XML (format = "grobid-xml") from the OpenAlex content endpoint (content.openalex.org). Accepts a vector of work IDs, supports parallel downloads via workers, and returns a data frame with per-file status ("ok" / "not_found" / "error"). Note: content downloads cost $0.01 per file.

  • Added search.exact and search.semantic parameters to pro_query(), matching the new OpenAlex search API:

    • search.exact: searches without stemming or stop-word removal; supports boolean operators, quoted phrases, proximity (~N), and wildcards.
    • search.semantic: AI embedding-based search that matches by conceptual meaning rather than keywords (max 50 results, max 1 req/sec).
    • search: now documented to support the full boolean/phrase/wildcard syntax in addition to its existing stemmed matching.
  • Exported infer_json_schema() for direct use. Infers a unified DuckDB columns clause from a set of JSON/NDJSON files via per-file DESCRIBE queries with type-widening and optional two-level disk caching (schema_cache_dir).

Internal Changes

Deprecations

  • Filter arguments with a .search suffix (e.g. title_and_abstract.search = "...") are deprecated by the OpenAlex API. They still work but now emit a warning. Use the search parameter of pro_query() instead: pro_query(entity = "works", search = "your terms"). See https://developers.openalex.org/guides/searching for details.

Bug Fixes

  • Fixed Windows path-normalization failures in snapshot_to_parquet(), build_corpus_index(), lookup_by_id(), and pro_request_jsonl_parquet(). On Windows, normalizePath() can return 8.3 short names (e.g. RUNNER~1) for tempdir()-derived paths while list.files() and DuckDB resolve to long names (runneradmin). Resume detection in snapshot_to_parquet() used %in% on paths with mixed separators (\ vs /), causing already-converted files to be reconverted. build_corpus_index() embedded snapshot_dir (with \) inside a DuckDB regexp_replace pattern, which never matched — so the full absolute path was stored in the index and later doubled by lookup_by_id(). pro_request_jsonl_parquet() used normalizePath string comparison to detect subdirectories, which always failed, placing every output file in a spurious query=<dirname> subdirectory.

    Fixes: (1) normalize separators to / with gsub("\\\\", "/", ...) on both sides of %in% comparisons; (2) compute relative paths in R using path-depth counting (strsplit(path, "/") then indexed extraction) rather than string-matching absolute paths — immune to 8.3 vs long-name differences;

    1. pass the relative path as a SQL literal in build_corpus_index() instead of computing it inside DuckDB with a regex.

Changes

  • Schema cache per-file CSVs renamed from %06d_<basename>.schema.csv to <update_date>_<part_name>.csv (e.g. 2024-01-15_part_001.csv), making each cache file directly traceable to its source .gz.

Breaking Changes

  • Removed mailto parameter from all API functions (pro_request(), pro_fetch(), pro_count(), pro_validate_credentials()). OpenAlex no longer uses email addresses for polite-pool access.
  • api_key handling was tightened in 0.6.0 for pro_request(), pro_fetch(), and pro_count().
    Note: this was later relaxed again in development; current development allows api_key = NULL / "" and runs in unauthenticated mode.
  • Simplified User-Agent string from openalexPro v[VERSION] (mailto:[EMAIL]) to openalexPro/[VERSION].

openalexPro 0.5.0

New Features

Snapshot Handling

  • Added prepare_snapshot() function for setting up a directory with Makefile and documentation for managing OpenAlex snapshots.
  • Added Makefile.snapshot in inst/ for automating snapshot download, conversion, and indexing. Includes targets for snapshot, parquet, parquet_index, and automatic renaming of existing data with release dates.
  • Added snapshot_to_parquet() function for converting OpenAlex snapshot NDJSON files to Parquet format using DuckDB. Processes each .gz file individually with per-file resume support. Supports parallel processing via workers (using future_lapply()) and unified schema inference via sample_size.
  • Added build_corpus_index() function for creating memory-efficient Parquet indexes for fast ID lookups. Handles 300M+ records by processing parquet files individually, with optional parallelization via workers and progress reporting via progressr. The index file is auto-named and placed alongside the corpus directory.
  • Added lookup_by_id() function for fast record retrieval from a parquet corpus using pre-built indexes. Uses Arrow for index filtering with automatic ID normalization. Supports parallel reads via workers and streaming to parquet via output for millions of IDs without loading into memory.
  • Added snapshot_filter_ids() function for filtering snapshot data by ID lists.
  • Added id_block() helper function for computing ID block partitions.

Documentation

  • Added snapshot.qmd vignette with comprehensive guide on downloading, converting, and querying OpenAlex snapshots locally.

Changes

  • Refactored snapshot_to_parquet() to process each .gz file individually instead of all at once. This reduces memory usage, enables per-file resume on interruption, and shows progress with ETA. The workers parameter now controls parallel future workers instead of DuckDB threads. Added sample_size parameter for schema inference.
  • Extracted infer_json_schema() and convert_json_to_parquet() internal helpers, shared by both snapshot_to_parquet() and pro_request_jsonl_parquet().
  • Refactored pro_request_jsonl_parquet() to per-file conversion with future_lapply() parallelization. Removes hive partitioning by page; subfolder structure is preserved directly. Added workers parameter. Removed progress parameter (replaced by progressr).

Bug Fixes

  • Fixed vignette parse errors in pro_query.qmd (malformed code block closings).
  • Fixed out-of-memory crash in snapshot_to_parquet() when sample_size exceeded the number of available files (e.g. sample_size = 10000 with 1981 works files). Schema inference now processes one file at a time instead of a single bulk DuckDB query.
  • Fixed duplicate key "as" crash when converting the works dataset. abstract_inverted_index is now stored as VARCHAR (raw JSON string) rather than a STRUCT. DuckDB folds struct field names to lowercase, causing a collision between the valid JSON keys "as" and "As" in this field. Storing as VARCHAR avoids struct parsing entirely and preserves the data. Parse individual values with jsonlite::fromJSON() when needed.
  • Fixed DuckDB temp file IO errors during snapshot_to_parquet() by exposing a TEMP_DIR variable in Makefile.snapshot (default /tmp).

Changes

  • snapshot_to_parquet() schema inference now runs one DuckDB DESCRIBE per file instead of a single query across all sampled files. Results are cached in <parquet_ds>/.schema_cache/: per-file CSVs (<update_date>_<part_name>.csv) enable mid-run resume; a unified unified_schema.csv is loaded on subsequent runs to skip inference entirely. Delete unified_schema.csv to force re-inference.

Tests

openalexPro v0.4.2

Breaking Changes

  • removal of load_sql_file() function as not needed anymore

Documentation

  • Update from vignettes and adding of new ones
  • Update of README.md

Tests

  • Remove need in tests for openalexR

openalexPro 0.4.1

  • Standardised progressbar handling
  • Changed default pages from 1,000 to 10,000
  • Refactored pro_query and removed multiple_ids argument using Claude and expanded tests and added vignette.
  • Added creation of 00_completed in output directory of json, jsonl and parquet folders upon successful completion
  • Changed api key and email handling. Removed oap_mail()_ and oap_apikey() and simplified handling of api key and email to only use environmental variables openalexPro.email and openalexPro.apikey
  • Added unified schema inference to pro_request_jsonl_parquet() to prevent schema conflicts when reading combined Parquet datasets. New sample_size parameter controls schema inference sampling. This fixes “Unsupported cast from string to struct” errors when fields have different types across JSONL files (e.g., apc_paid being null in some files and a struct in others).
  • Removed harmonize_parquet_schemata() as it is no longer needed with the new unified schema inference.
  • Increased default n umber of pages to be read by request_json() from 1000 to 10000 to allow the initially planned 2,000,000 work download.

openalexPro 0.4.0

  • CI and coverage tweaks for CRAN readiness.

  • splitting snowball functionality into openalexSnowball

openalexPro 0.3.1

  • Added pro_fetch() with project_folder support for structured outputs.
  • Added progress reporting and parallelization for pro_request_jsonl().
  • Added sample_parquet_n() random sampling utilities with select support.
  • Improved count_only output to return a data frame with an error column.

openalexPro 0.3.0

  • Added count_only support for pro_request() and related helpers.
  • Added DOI handling improvements and API call fixes.

openalexPro 0.2.0

  • Introduced pro_query() as the package-native query builder with chunking.
  • Added snowball search utilities and citation edge extraction workflow.
  • Expanded conversion pipeline tests and VCR-based API fixtures.
  • Added extract_doi() helpers and compatibility reporting artifacts.