Databricks C++ SDK 0.2.4
Interact with Databricks via an SDK
Loading...
Searching...
No Matches
Databricks C++ SDK

Latest Release License: MIT Documentation

A C++ SDK for Databricks, providing an interface for interacting with Databricks services.

Latest Release: v0.2.4

Author: Calvin Min (calvi.nosp@m.njmi.nosp@m.n@gma.nosp@m.il.c.nosp@m.om)


Table of Contents

  • Requirements
    • ODBC Driver Setup
    • Automated Setup Check
  • Installation
    • Option 1: CMake FetchContent (Recommended)
    • Option 2: vcpkg
    • Option 3: Manual Build and Install
  • Building from Source
  • Quick Start
  • Configuration
  • Running Examples
  • Performance Considerations
  • Advanced Usage
  • Documentation
  • License
  • Contributing

Requirements

  • C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
  • CMake 3.14 or higher
  • ODBC Driver Manager:
    • Linux/macOS: unixODBC (brew install unixodbc or apt-get install unixodbc-dev)
    • Windows: Built-in ODBC Driver Manager
  • Simba Spark ODBC Driver: Download from Databricks

ODBC Driver Setup

After installing the requirements above, you need to configure the ODBC driver:

Linux/macOS

  1. Install unixODBC (if not already installed):
    # macOS
    brew install unixodbc
    # Ubuntu/Debian
    sudo apt-get install unixodbc unixodbc-dev
    # RedHat/CentOS
    sudo yum install unixODBC unixODBC-devel
  2. Download and install Simba Spark ODBC Driver from Databricks Downloads
  3. Verify driver installation:
    odbcinst -q -d
    You should see "Simba Spark ODBC Driver" in the output.
  4. If driver is not found, check ODBC configuration locations:
    odbcinst -j
    Ensure the driver is registered in the odbcinst.ini file shown.

Windows

  1. Download and run the Simba Spark ODBC Driver installer from Databricks Downloads
  2. The installer will automatically register the driver with Windows ODBC Driver Manager

Using Alternative ODBC Drivers

If you prefer to use a different ODBC driver, you can configure it:

sql.odbc_driver_name = "Your Driver Name Here"; // Must match driver name from odbcinst -q -d
.with_sql(sql)
.build();
Builder pattern for constructing Client with modular configuration.
Definition client.h:74
Builder & with_sql(const SQLConfig &sql)
Set SQL configuration.
Builder & with_environment_config(const std::string &profile="DEFAULT")
Load configuration from environment (profile + env vars)
Client build()
Build the Client.
SQL-specific configuration for Databricks SQL operations.
Definition config.h:72
std::string odbc_driver_name
ODBC driver name (default: Simba Spark ODBC Driver)
Definition config.h:74

Automated Setup Check

Run the setup checker script to verify your ODBC configuration:

./scripts/check_odbc_setup.sh

This will verify:

  • unixODBC installation
  • ODBC configuration files
  • Installed ODBC drivers (including Simba Spark)
  • Library paths

Installation

Option 1: CMake FetchContent (Recommended - Direct from GitHub)

Add to your CMakeLists.txt:

include(FetchContent)
FetchContent_Declare(
databricks_sdk
GIT_REPOSITORY https://github.com/calvinjmin/databricks-sdk-cpp.git
GIT_TAG main # latest tag or declare a specific version like 0.1.0
)
FetchContent_MakeAvailable(databricks_sdk)
target_link_libraries(my_app PRIVATE databricks_sdk::databricks_sdk)

Advantages: No separate installation step, always gets the exact version you specify.

Option 2: vcpkg

Once published to vcpkg (submission in progress), install with:

vcpkg install databricks-sdk-cpp

Then use in your CMake project:

find_package(databricks_sdk CONFIG REQUIRED)
target_link_libraries(my_app PRIVATE databricks_sdk::databricks_sdk)

For maintainers: See dev-docs/VCPKG_SUBMISSION.md for the complete submission guide.

Option 3: Manual Build and Install

# Clone and build
git clone https://github.com/calvinjmin/databricks-sdk-cpp.git
cd databricks-sdk-cpp
mkdir build && cd build
cmake ..
cmake --build .
# Install (requires sudo on Linux/macOS)
sudo cmake --install .

Then use in your project:

find_package(databricks_sdk REQUIRED)
target_link_libraries(my_app PRIVATE databricks_sdk::databricks_sdk)

Building from Source

# Create build directory
mkdir build && cd build
# Configure
cmake ..
# Build
cmake --build .
# Install (optional)
sudo cmake --install .

Build Options

  • BUILD_EXAMPLES (default: ON) - Build example applications
  • BUILD_TESTS (default: OFF) - Build unit tests
  • BUILD_SHARED_LIBS (default: ON) - Build as shared library

Example:

cmake -DBUILD_EXAMPLES=ON -DBUILD_TESTS=ON ..

Quick Start

Configuration

The SDK uses a modular configuration system with separate concerns for authentication, SQL settings, and connection pooling. The Builder pattern provides a clean API for constructing clients.

Configuration Structure

The SDK separates configuration into four distinct concerns:

  • **AuthConfig**: Core authentication (host, token, timeout) - shared across all Databricks features
  • **SQLConfig**: SQL-specific settings (http_path, ODBC driver name)
  • **PoolingConfig**: Optional connection pooling settings (enabled, min/max connections)
  • **RetryConfig**: Optional automatic retry settings (enabled, max attempts, backoff strategy)

This modular design allows you to:

  • Share AuthConfig across different Databricks service clients (SQL, Workspace, Delta, etc.)
  • Configure only what you need
  • Mix automatic and explicit configuration

Option 1: Automatic Configuration (Recommended)

The SDK automatically loads configuration from ~/.databrickscfg or environment variables:

#include <databricks/client.h>
int main() {
// Load from ~/.databrickscfg or environment variables
.build();
auto results = client.query("SELECT * FROM my_table LIMIT 10");
return 0;
}
std::vector< std::vector< std::string > > query(const std::string &sql, const std::vector< Parameter > &params={})
Execute a SQL query against Databricks.

Configuration Precedence (highest to lowest):

  1. Profile file (~/.databrickscfg with [DEFAULT] section) - if complete, used exclusively
  2. Environment variables (DATABRICKS_HOST, DATABRICKS_TOKEN, DATABRICKS_HTTP_PATH) - only as fallback

Option 2: Profile File

Create ~/.databrickscfg:

[DEFAULT]
host = https://my-workspace.databricks.com
token = dapi1234567890abcdef
http_path = /sql/1.0/warehouses/abc123
# Alternative key name also supported:
# sql_http_path = /sql/1.0/warehouses/abc123
[production]
host = https://prod.databricks.com
token = dapi_prod_token
http_path = /sql/1.0/warehouses/prod123

Load specific profile:

.with_environment_config("production")
.build();

Option 3: Environment Variables Only

export DATABRICKS_HOST="https://my-workspace.databricks.com"
export DATABRICKS_TOKEN="dapi1234567890abcdef"
export DATABRICKS_HTTP_PATH="/sql/1.0/warehouses/abc123"
export DATABRICKS_TIMEOUT=120 # Optional
# Alternative variable names also supported:
# DATABRICKS_SERVER_HOSTNAME, DATABRICKS_ACCESS_TOKEN, DATABRICKS_SQL_HTTP_PATH

Option 4: Manual Configuration

#include <databricks/client.h>
int main() {
// Configure authentication
auth.host = "https://my-workspace.databricks.com";
auth.token = "dapi1234567890abcdef";
auth.timeout_seconds = 60;
// Configure SQL settings
sql.http_path = "/sql/1.0/warehouses/abc123";
sql.odbc_driver_name = "Simba Spark ODBC Driver";
// Build client
.with_auth(auth)
.with_sql(sql)
.build();
// Execute a query
auto results = client.query("SELECT * FROM my_table LIMIT 10");
return 0;
}
Builder & with_auth(const AuthConfig &auth)
Set authentication configuration.
Core authentication configuration shared across all Databricks features.
Definition config.h:16
std::string host
Databricks workspace URL (e.g., "https://your-workspace.cloud.databricks.com")
Definition config.h:17
int timeout_seconds
Request timeout in seconds (default: 60)
Definition config.h:19
std::string token
Authentication token (personal access token or OAuth token)
Definition config.h:18
std::string http_path
HTTP path for SQL warehouse/cluster (e.g., "/sql/1.0/warehouses/abc123")
Definition config.h:73

Async Connection (Non-blocking)

#include <databricks/client.h>
int main() {
// Build client without auto-connecting
.build();
// Start connection asynchronously
auto connect_future = client.connect_async();
// Do other work while connecting...
// Wait for connection before querying
connect_future.wait();
auto results = client.query("SELECT current_timestamp()");
return 0;
}
Builder & with_auto_connect(bool enable=true)
Enable auto-connect (connects immediately on build)
std::future< void > connect_async()
Asynchronously establish connection to Databricks.

Connection Pooling (High Performance)

#include <databricks/client.h>
int main() {
// Configure pooling
pooling.enabled = true;
pooling.min_connections = 2;
pooling.max_connections = 10;
// Build client with pooling
.with_pooling(pooling)
.build();
// Query as usual - connections acquired/released automatically
auto results = client.query("SELECT * FROM my_table");
return 0;
}
Builder & with_pooling(const PoolingConfig &pooling)
Set connection pooling configuration (optional)
Connection pooling configuration (optional performance optimization)
Definition config.h:91
bool enabled
Enable connection pooling (default: false)
Definition config.h:92
size_t max_connections
Maximum connections allowed in pool (default: 10)
Definition config.h:94
size_t min_connections
Minimum connections to maintain in pool (default: 1)
Definition config.h:93

Note: Multiple Clients with the same config automatically share the same pool!

Automatic Retry Logic (Reliability)

The SDK includes automatic retry logic with exponential backoff for transient failures:

#include <databricks/client.h>
int main() {
// Configure retry behavior
retry.enabled = true; // Enable retries (default: true)
retry.max_attempts = 5; // Retry up to 5 times (default: 3)
retry.initial_backoff_ms = 200; // Start with 200ms delay (default: 100ms)
retry.backoff_multiplier = 2.0; // Double delay each retry (default: 2.0)
retry.max_backoff_ms = 10000; // Cap at 10 seconds (default: 10000ms)
retry.retry_on_timeout = true; // Retry timeout errors (default: true)
retry.retry_on_connection_lost = true;// Retry connection errors (default: true)
// Build client with retry configuration
.with_retry(retry)
.build();
// Queries automatically retry on transient errors
auto results = client.query("SELECT * FROM my_table");
return 0;
}
Builder & with_retry(const RetryConfig &retry)
Set retry configuration for automatic error recovery.
Retry configuration for automatic error recovery.
Definition config.h:125
bool retry_on_timeout
Retry on connection timeout (default: true)
Definition config.h:131
size_t max_attempts
Maximum retry attempts (default: 3)
Definition config.h:127
size_t initial_backoff_ms
Initial backoff in milliseconds (default: 100ms)
Definition config.h:128
bool retry_on_connection_lost
Retry on connection errors (default: true)
Definition config.h:132
bool enabled
Enable automatic retries (default: true)
Definition config.h:126
double backoff_multiplier
Exponential backoff multiplier (default: 2x)
Definition config.h:129
size_t max_backoff_ms
Maximum backoff cap (default: 10s)
Definition config.h:130

Retry Features:

  • Exponential backoff with jitter to prevent thundering herd
  • Intelligent error classification - only retries transient errors:
    • Connection timeouts and network errors
    • Server unavailability (503, 502, 504)
    • Rate limiting (429 Too Many Requests)
  • Non-retryable errors fail immediately:
    • Authentication failures
    • SQL syntax errors
    • Permission denied errors
  • Enabled by default with sensible defaults
  • Works with connection pooling for maximum reliability

Disable Retries (if needed):

no_retry.enabled = false;
.with_retry(no_retry)
.build();

Mixing Configuration Approaches

The Builder pattern allows you to mix automatic and explicit configuration:

// Load auth from environment, but customize pooling
pooling.enabled = true;
pooling.max_connections = 20;
.with_environment_config() // Load auth + SQL from environment
.with_pooling(pooling) // Override pooling settings
.build();

Or load auth separately from SQL settings:

// Load auth from profile, SQL from environment
sql.http_path = std::getenv("CUSTOM_HTTP_PATH");
.with_auth(auth)
.with_sql(sql)
.build();
static AuthConfig from_profile(const std::string &profile="DEFAULT")
Load authentication configuration from Databricks CLI profile.

Accessing Configuration

You can access the modular configuration from any client:

.build();
// Access configuration
const auto& auth = client.get_auth_config();
const auto& sql = client.get_sql_config();
const auto& pooling = client.get_pooling_config();
std::cout << "Connected to: " << auth.host << std::endl;
std::cout << "Using warehouse: " << sql.http_path << std::endl;

For a complete example, see examples/simple_query.cpp.

Running Examples

Setup Configuration

Examples automatically load configuration from either:

Option A: Profile File (recommended for development)

Create ~/.databrickscfg:

[DEFAULT]
host = https://your-workspace.databricks.com
token = your_databricks_token
http_path = /sql/1.0/warehouses/your_warehouse_id
# or: sql_http_path = /sql/1.0/warehouses/your_warehouse_id

Option B: Environment Variables (recommended for CI/CD)

export DATABRICKS_HOST="https://your-workspace.databricks.com"
export DATABRICKS_TOKEN="your_databricks_token"
export DATABRICKS_HTTP_PATH="/sql/1.0/warehouses/your_warehouse_id"

Or source a .env file:

set -a; source .env; set +a

Note: Profile configuration takes priority. Environment variables are used only as a fallback when no profile is configured.

Run Examples

After building with BUILD_EXAMPLES=ON, the following examples are available:

# SQL query execution with parameterized queries
./build/examples/simple_query
# Jobs API - list jobs, get details, trigger runs
./build/examples/jobs_example
# Compute API - manage clusters, create/start/stop/terminate
./build/examples/compute_example

Each example demonstrates a different aspect of the SDK:

  • simple_query: Basic SQL execution and parameterized queries
  • jobs_example: Jobs API for workflow automation
  • compute_example: Compute/Clusters API for cluster management

Performance Considerations

Connection Pooling Benefits

Connection pooling eliminates the overhead of creating new ODBC connections for each query:

  • Without pooling: 500-2000ms per query (includes connection time)
  • With pooling: 1-50ms per query (connection reused)
  • Recommended: Use pooling for applications making multiple queries

Async Operations Benefits

Async operations reduce perceived latency by performing work in the background:

  • Async connect: Start connecting while doing other initialization
  • Async query: Execute multiple queries concurrently
  • Combined with pooling: Maximum throughput for concurrent workloads

Best Practices

  1. Enable pooling via PoolingConfig for applications making multiple queries
  2. Use async operations when you can do other work while waiting
  3. Enable retry logic (on by default) for production reliability against transient failures
  4. Combine pooling + retries for maximum reliability and performance
  5. Size pools appropriately: min = typical concurrent load, max = peak load
  6. Share configs: Clients with identical configs automatically share pools
  7. Tune retry settings based on your workload:
    • High-throughput: Lower max_attempts (2-3) to fail fast
    • Critical operations: Higher max_attempts (5-7) for maximum reliability
    • Rate-limited APIs: Increase initial_backoff_ms and max_backoff_ms

Advanced Usage

Jobs API

Interact with Databricks Jobs to automate and orchestrate data workflows:

#include <databricks/jobs.h>
#include <databricks/config.h>
int main() {
// Load auth configuration
// Create Jobs API client
databricks::Jobs jobs(auth);
// List all jobs
auto job_list = jobs.list_jobs(25, 0);
for (const auto& job : job_list) {
std::cout << "Job: " << job.name
<< " (ID: " << job.job_id << ")" << std::endl;
}
// Get specific job details
auto job = jobs.get_job(123456789);
std::cout << "Created by: " << job.creator_user_name << std::endl;
// Trigger a job run with parameters
std::map<std::string, std::string> params;
params["date"] = "2024-01-01";
params["environment"] = "production";
uint64_t run_id = jobs.run_now(123456789, params);
std::cout << "Started run: " << run_id << std::endl;
return 0;
}
Client for interacting with the Databricks Jobs API.
Definition jobs.h:40
static AuthConfig from_environment(const std::string &profile="DEFAULT")
Load authentication configuration from all available sources.

Key Features:

  • List jobs: Paginated listing with limit/offset support
  • Get job details: Retrieve full job configuration and metadata
  • Trigger runs: Start jobs with optional notebook parameters
  • Type-safe IDs: Uses uint64_t to correctly handle large job IDs
  • JSON parsing: Built on nlohmann/json for reliable parsing

API Compatibility:

  • Uses Jobs API 2.2 for full feature support including pagination
  • Timestamps returned as Unix milliseconds (uint64_t)
  • Automatic error handling with descriptive messages

For a complete example, see examples/jobs_example.cpp.

Compute/Clusters API

Manage Databricks compute clusters programmatically:

int main() {
databricks::Compute compute(auth);
// List clusters
auto clusters = compute.list_compute();
for (const auto& c : clusters) {
std::cout << c.cluster_name << " [" << c.state << "]" << std::endl;
}
// Lifecycle management
compute.start_compute("cluster-id");
compute.restart_compute("cluster-id");
compute.terminate_compute("cluster-id");
return 0;
}
Client for interacting with the Databricks Clusters/Compute API.
Definition compute.h:37

Features:

  • List/get cluster details
  • Start, restart, and terminate clusters
  • Cluster state tracking (PENDING, RUNNING, TERMINATED, etc.)
  • Automatic HTTP retry logic with exponential backoff

HTTP Retry Logic:

All REST API calls automatically retry on transient failures (408, 429, 500-504) with exponential backoff (1s, 2s, 4s). This is built into the HTTP client and requires no configuration

Direct ConnectionPool Management

For advanced users who need fine-grained control over connection pools:

// Build config for pool
auth.host = "https://my-workspace.databricks.com";
auth.token = "dapi1234567890abcdef";
sql.http_path = "/sql/1.0/warehouses/abc123";
// Create and manage pool explicitly
databricks::ConnectionPool pool(auth, sql, 2, 10);
pool.warm_up();
// Acquire connections manually
{
auto pooled_conn = pool.acquire();
auto results = pooled_conn->query("SELECT...");
} // Connection returns to pool
// Monitor pool
auto stats = pool.get_stats();
std::cout << "Available: " << stats.available_connections << std::endl;
Thread-safe connection pool for managing Databricks connections.

Note: Most users should use the Builder with PoolingConfig instead of direct pool management.

Documentation

The SDK includes comprehensive API documentation generated from code comments using Doxygen.

📚 View Online Documentation

Live Documentation: https://calvinjmin.github.io/databricks-sdk-cpp/

The documentation is automatically built and published via GitHub Actions whenever changes are pushed to the main branch.

Generate Documentation Locally

# Install Doxygen
brew install doxygen # macOS
# or: sudo apt-get install doxygen # Linux
# Generate docs (creates docs/html/)
doxygen Doxyfile
# View in browser
open docs/html/index.html # macOS
# or: xdg-open docs/html/index.html # Linux

Documentation Features

The generated documentation includes:

  • Complete API Reference: All public classes, methods, and structs with detailed descriptions
  • README Integration: Full README displayed as the main landing page
  • Code Examples: Inline examples from header comments
  • Jobs API Documentation: Full reference for databricks::Jobs, Job, and JobRun types
  • SQL Client Documentation: Complete databricks::Client API reference
  • Connection Pooling: databricks::ConnectionPool and configuration types
  • Source Browser: Browse source code with syntax highlighting
  • Search Functionality: Quick search across all documentation
  • Cross-references: Navigate between related classes and methods

Quick Links (After Generation)

  • Main Page: docs/html/index.html - README and getting started
  • Classes: docs/html/annotated.html - All classes and structs
  • Jobs API: docs/html/classdatabricks_1_1_jobs.html - Jobs API reference
  • Client API: docs/html/classdatabricks_1_1_client.html - SQL client reference
  • Files: docs/html/files.html - Browse by file

Example: Viewing Jobs API Docs

# Generate and open Jobs API documentation
doxygen Doxyfile
open docs/html/classdatabricks_1_1_jobs.html

The documentation is automatically generated from the inline comments in header files, ensuring it stays synchronized with the code.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

For issues and questions, please open an issue on the GitHub repository.