
A C++ SDK for Databricks, providing an interface for interacting with Databricks services.
Latest Release: v0.2.4
Author: Calvin Min (calvi.nosp@m.njmi.nosp@m.n@gma.nosp@m.il.c.nosp@m.om)
Table of Contents
- Requirements
- ODBC Driver Setup
- Automated Setup Check
- Installation
- Option 1: CMake FetchContent (Recommended)
- Option 2: vcpkg
- Option 3: Manual Build and Install
- Building from Source
- Quick Start
- Configuration
- Running Examples
- Performance Considerations
- Advanced Usage
- Documentation
- License
- Contributing
Requirements
- C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
- CMake 3.14 or higher
- ODBC Driver Manager:
- Linux/macOS: unixODBC (
brew install unixodbc or apt-get install unixodbc-dev)
- Windows: Built-in ODBC Driver Manager
- Simba Spark ODBC Driver: Download from Databricks
ODBC Driver Setup
After installing the requirements above, you need to configure the ODBC driver:
Linux/macOS
- Install unixODBC (if not already installed):
# macOS
brew install unixodbc
# Ubuntu/Debian
sudo apt-get install unixodbc unixodbc-dev
# RedHat/CentOS
sudo yum install unixODBC unixODBC-devel
- Download and install Simba Spark ODBC Driver from Databricks Downloads
- Verify driver installation: You should see "Simba Spark ODBC Driver" in the output.
- If driver is not found, check ODBC configuration locations: Ensure the driver is registered in the
odbcinst.ini file shown.
Windows
- Download and run the Simba Spark ODBC Driver installer from Databricks Downloads
- The installer will automatically register the driver with Windows ODBC Driver Manager
Using Alternative ODBC Drivers
If you prefer to use a different ODBC driver, you can configure it:
Builder pattern for constructing Client with modular configuration.
Builder & with_sql(const SQLConfig &sql)
Set SQL configuration.
Builder & with_environment_config(const std::string &profile="DEFAULT")
Load configuration from environment (profile + env vars)
Client build()
Build the Client.
SQL-specific configuration for Databricks SQL operations.
std::string odbc_driver_name
ODBC driver name (default: Simba Spark ODBC Driver)
Automated Setup Check
Run the setup checker script to verify your ODBC configuration:
./scripts/check_odbc_setup.sh
This will verify:
- unixODBC installation
- ODBC configuration files
- Installed ODBC drivers (including Simba Spark)
- Library paths
Installation
Option 1: CMake FetchContent (Recommended - Direct from GitHub)
Add to your CMakeLists.txt:
include(FetchContent)
FetchContent_Declare(
databricks_sdk
GIT_REPOSITORY https://github.com/calvinjmin/databricks-sdk-cpp.git
GIT_TAG main # latest tag or declare a specific version like 0.1.0
)
FetchContent_MakeAvailable(databricks_sdk)
target_link_libraries(my_app PRIVATE databricks_sdk::databricks_sdk)
Advantages: No separate installation step, always gets the exact version you specify.
Option 2: vcpkg
Once published to vcpkg (submission in progress), install with:
vcpkg install databricks-sdk-cpp
Then use in your CMake project:
find_package(databricks_sdk CONFIG REQUIRED)
target_link_libraries(my_app PRIVATE databricks_sdk::databricks_sdk)
For maintainers: See dev-docs/VCPKG_SUBMISSION.md for the complete submission guide.
Option 3: Manual Build and Install
# Clone and build
git clone https://github.com/calvinjmin/databricks-sdk-cpp.git
cd databricks-sdk-cpp
mkdir build && cd build
cmake ..
cmake --build .
# Install (requires sudo on Linux/macOS)
sudo cmake --install .
Then use in your project:
find_package(databricks_sdk REQUIRED)
target_link_libraries(my_app PRIVATE databricks_sdk::databricks_sdk)
Building from Source
# Create build directory
mkdir build && cd build
# Configure
cmake ..
# Build
cmake --build .
# Install (optional)
sudo cmake --install .
Build Options
BUILD_EXAMPLES (default: ON) - Build example applications
BUILD_TESTS (default: OFF) - Build unit tests
BUILD_SHARED_LIBS (default: ON) - Build as shared library
Example:
cmake -DBUILD_EXAMPLES=ON -DBUILD_TESTS=ON ..
Quick Start
Configuration
The SDK uses a modular configuration system with separate concerns for authentication, SQL settings, and connection pooling. The Builder pattern provides a clean API for constructing clients.
Configuration Structure
The SDK separates configuration into four distinct concerns:
- **
AuthConfig**: Core authentication (host, token, timeout) - shared across all Databricks features
- **
SQLConfig**: SQL-specific settings (http_path, ODBC driver name)
- **
PoolingConfig**: Optional connection pooling settings (enabled, min/max connections)
- **
RetryConfig**: Optional automatic retry settings (enabled, max attempts, backoff strategy)
This modular design allows you to:
- Share
AuthConfig across different Databricks service clients (SQL, Workspace, Delta, etc.)
- Configure only what you need
- Mix automatic and explicit configuration
Option 1: Automatic Configuration (Recommended)
The SDK automatically loads configuration from ~/.databrickscfg or environment variables:
#include <databricks/client.h>
int main() {
auto results = client.
query(
"SELECT * FROM my_table LIMIT 10");
return 0;
}
std::vector< std::vector< std::string > > query(const std::string &sql, const std::vector< Parameter > ¶ms={})
Execute a SQL query against Databricks.
Configuration Precedence (highest to lowest):
- Profile file (
~/.databrickscfg with [DEFAULT] section) - if complete, used exclusively
- Environment variables (
DATABRICKS_HOST, DATABRICKS_TOKEN, DATABRICKS_HTTP_PATH) - only as fallback
Option 2: Profile File
Create ~/.databrickscfg:
[DEFAULT]
host = https://my-workspace.databricks.com
token = dapi1234567890abcdef
http_path = /sql/1.0/warehouses/abc123
# Alternative key name also supported:
# sql_http_path = /sql/1.0/warehouses/abc123
[production]
host = https://prod.databricks.com
token = dapi_prod_token
http_path = /sql/1.0/warehouses/prod123
Load specific profile:
Option 3: Environment Variables Only
export DATABRICKS_HOST="https://my-workspace.databricks.com"
export DATABRICKS_TOKEN="dapi1234567890abcdef"
export DATABRICKS_HTTP_PATH="/sql/1.0/warehouses/abc123"
export DATABRICKS_TIMEOUT=120 # Optional
# Alternative variable names also supported:
# DATABRICKS_SERVER_HOSTNAME, DATABRICKS_ACCESS_TOKEN, DATABRICKS_SQL_HTTP_PATH
Option 4: Manual Configuration
#include <databricks/client.h>
int main() {
auth.
host =
"https://my-workspace.databricks.com";
auth.
token =
"dapi1234567890abcdef";
sql.
http_path =
"/sql/1.0/warehouses/abc123";
auto results = client.
query(
"SELECT * FROM my_table LIMIT 10");
return 0;
}
Builder & with_auth(const AuthConfig &auth)
Set authentication configuration.
Core authentication configuration shared across all Databricks features.
std::string host
Databricks workspace URL (e.g., "https://your-workspace.cloud.databricks.com")
int timeout_seconds
Request timeout in seconds (default: 60)
std::string token
Authentication token (personal access token or OAuth token)
std::string http_path
HTTP path for SQL warehouse/cluster (e.g., "/sql/1.0/warehouses/abc123")
Async Connection (Non-blocking)
#include <databricks/client.h>
int main() {
connect_future.wait();
auto results = client.query("SELECT current_timestamp()");
return 0;
}
Builder & with_auto_connect(bool enable=true)
Enable auto-connect (connects immediately on build)
std::future< void > connect_async()
Asynchronously establish connection to Databricks.
Connection Pooling (High Performance)
#include <databricks/client.h>
int main() {
auto results = client.
query(
"SELECT * FROM my_table");
return 0;
}
Builder & with_pooling(const PoolingConfig &pooling)
Set connection pooling configuration (optional)
Connection pooling configuration (optional performance optimization)
bool enabled
Enable connection pooling (default: false)
size_t max_connections
Maximum connections allowed in pool (default: 10)
size_t min_connections
Minimum connections to maintain in pool (default: 1)
Note: Multiple Clients with the same config automatically share the same pool!
Automatic Retry Logic (Reliability)
The SDK includes automatic retry logic with exponential backoff for transient failures:
#include <databricks/client.h>
int main() {
auto results = client.
query(
"SELECT * FROM my_table");
return 0;
}
Builder & with_retry(const RetryConfig &retry)
Set retry configuration for automatic error recovery.
Retry configuration for automatic error recovery.
bool retry_on_timeout
Retry on connection timeout (default: true)
size_t max_attempts
Maximum retry attempts (default: 3)
size_t initial_backoff_ms
Initial backoff in milliseconds (default: 100ms)
bool retry_on_connection_lost
Retry on connection errors (default: true)
bool enabled
Enable automatic retries (default: true)
double backoff_multiplier
Exponential backoff multiplier (default: 2x)
size_t max_backoff_ms
Maximum backoff cap (default: 10s)
Retry Features:
- Exponential backoff with jitter to prevent thundering herd
- Intelligent error classification - only retries transient errors:
- Connection timeouts and network errors
- Server unavailability (503, 502, 504)
- Rate limiting (429 Too Many Requests)
- Non-retryable errors fail immediately:
- Authentication failures
- SQL syntax errors
- Permission denied errors
- Enabled by default with sensible defaults
- Works with connection pooling for maximum reliability
Disable Retries (if needed):
Mixing Configuration Approaches
The Builder pattern allows you to mix automatic and explicit configuration:
Or load auth separately from SQL settings:
sql.
http_path = std::getenv(
"CUSTOM_HTTP_PATH");
static AuthConfig from_profile(const std::string &profile="DEFAULT")
Load authentication configuration from Databricks CLI profile.
Accessing Configuration
You can access the modular configuration from any client:
const auto& auth = client.get_auth_config();
const auto& sql = client.get_sql_config();
const auto& pooling = client.get_pooling_config();
std::cout <<
"Connected to: " << auth.
host << std::endl;
std::cout <<
"Using warehouse: " << sql.
http_path << std::endl;
For a complete example, see examples/simple_query.cpp.
Running Examples
Setup Configuration
Examples automatically load configuration from either:
Option A: Profile File (recommended for development)
Create ~/.databrickscfg:
[DEFAULT]
host = https://your-workspace.databricks.com
token = your_databricks_token
http_path = /sql/1.0/warehouses/your_warehouse_id
# or: sql_http_path = /sql/1.0/warehouses/your_warehouse_id
Option B: Environment Variables (recommended for CI/CD)
export DATABRICKS_HOST="https://your-workspace.databricks.com"
export DATABRICKS_TOKEN="your_databricks_token"
export DATABRICKS_HTTP_PATH="/sql/1.0/warehouses/your_warehouse_id"
Or source a .env file:
set -a; source .env; set +a
Note: Profile configuration takes priority. Environment variables are used only as a fallback when no profile is configured.
Run Examples
After building with BUILD_EXAMPLES=ON, the following examples are available:
# SQL query execution with parameterized queries
./build/examples/simple_query
# Jobs API - list jobs, get details, trigger runs
./build/examples/jobs_example
# Compute API - manage clusters, create/start/stop/terminate
./build/examples/compute_example
Each example demonstrates a different aspect of the SDK:
- simple_query: Basic SQL execution and parameterized queries
- jobs_example: Jobs API for workflow automation
- compute_example: Compute/Clusters API for cluster management
Performance Considerations
Connection Pooling Benefits
Connection pooling eliminates the overhead of creating new ODBC connections for each query:
- Without pooling: 500-2000ms per query (includes connection time)
- With pooling: 1-50ms per query (connection reused)
- Recommended: Use pooling for applications making multiple queries
Async Operations Benefits
Async operations reduce perceived latency by performing work in the background:
- Async connect: Start connecting while doing other initialization
- Async query: Execute multiple queries concurrently
- Combined with pooling: Maximum throughput for concurrent workloads
Best Practices
- Enable pooling via
PoolingConfig for applications making multiple queries
- Use async operations when you can do other work while waiting
- Enable retry logic (on by default) for production reliability against transient failures
- Combine pooling + retries for maximum reliability and performance
- Size pools appropriately: min = typical concurrent load, max = peak load
- Share configs: Clients with identical configs automatically share pools
- Tune retry settings based on your workload:
- High-throughput: Lower
max_attempts (2-3) to fail fast
- Critical operations: Higher
max_attempts (5-7) for maximum reliability
- Rate-limited APIs: Increase
initial_backoff_ms and max_backoff_ms
Advanced Usage
Jobs API
Interact with Databricks Jobs to automate and orchestrate data workflows:
#include <databricks/jobs.h>
#include <databricks/config.h>
int main() {
auto job_list = jobs.list_jobs(25, 0);
for (const auto& job : job_list) {
std::cout << "Job: " << job.name
<< " (ID: " << job.job_id << ")" << std::endl;
}
auto job = jobs.get_job(123456789);
std::cout << "Created by: " << job.creator_user_name << std::endl;
std::map<std::string, std::string> params;
params["date"] = "2024-01-01";
params["environment"] = "production";
uint64_t run_id = jobs.run_now(123456789, params);
std::cout << "Started run: " << run_id << std::endl;
return 0;
}
Client for interacting with the Databricks Jobs API.
static AuthConfig from_environment(const std::string &profile="DEFAULT")
Load authentication configuration from all available sources.
Key Features:
- List jobs: Paginated listing with limit/offset support
- Get job details: Retrieve full job configuration and metadata
- Trigger runs: Start jobs with optional notebook parameters
- Type-safe IDs: Uses
uint64_t to correctly handle large job IDs
- JSON parsing: Built on
nlohmann/json for reliable parsing
API Compatibility:
- Uses Jobs API 2.2 for full feature support including pagination
- Timestamps returned as Unix milliseconds (
uint64_t)
- Automatic error handling with descriptive messages
For a complete example, see examples/jobs_example.cpp.
Compute/Clusters API
Manage Databricks compute clusters programmatically:
int main() {
auto clusters = compute.list_compute();
for (const auto& c : clusters) {
std::cout << c.cluster_name << " [" << c.state << "]" << std::endl;
}
compute.start_compute("cluster-id");
compute.restart_compute("cluster-id");
compute.terminate_compute("cluster-id");
return 0;
}
Client for interacting with the Databricks Clusters/Compute API.
Features:
- List/get cluster details
- Start, restart, and terminate clusters
- Cluster state tracking (PENDING, RUNNING, TERMINATED, etc.)
- Automatic HTTP retry logic with exponential backoff
HTTP Retry Logic:
All REST API calls automatically retry on transient failures (408, 429, 500-504) with exponential backoff (1s, 2s, 4s). This is built into the HTTP client and requires no configuration
Direct ConnectionPool Management
For advanced users who need fine-grained control over connection pools:
auth.
host =
"https://my-workspace.databricks.com";
auth.
token =
"dapi1234567890abcdef";
sql.
http_path =
"/sql/1.0/warehouses/abc123";
pool.warm_up();
{
auto pooled_conn = pool.acquire();
auto results = pooled_conn->query("SELECT...");
}
auto stats = pool.get_stats();
std::cout << "Available: " << stats.available_connections << std::endl;
Thread-safe connection pool for managing Databricks connections.
Note: Most users should use the Builder with PoolingConfig instead of direct pool management.
Documentation
The SDK includes comprehensive API documentation generated from code comments using Doxygen.
📚 View Online Documentation
Live Documentation: https://calvinjmin.github.io/databricks-sdk-cpp/
The documentation is automatically built and published via GitHub Actions whenever changes are pushed to the main branch.
Generate Documentation Locally
# Install Doxygen
brew install doxygen # macOS
# or: sudo apt-get install doxygen # Linux
# Generate docs (creates docs/html/)
doxygen Doxyfile
# View in browser
open docs/html/index.html # macOS
# or: xdg-open docs/html/index.html # Linux
Documentation Features
The generated documentation includes:
- Complete API Reference: All public classes, methods, and structs with detailed descriptions
- README Integration: Full README displayed as the main landing page
- Code Examples: Inline examples from header comments
- Jobs API Documentation: Full reference for
databricks::Jobs, Job, and JobRun types
- SQL Client Documentation: Complete
databricks::Client API reference
- Connection Pooling:
databricks::ConnectionPool and configuration types
- Source Browser: Browse source code with syntax highlighting
- Search Functionality: Quick search across all documentation
- Cross-references: Navigate between related classes and methods
Quick Links (After Generation)
- Main Page:
docs/html/index.html - README and getting started
- Classes:
docs/html/annotated.html - All classes and structs
- Jobs API:
docs/html/classdatabricks_1_1_jobs.html - Jobs API reference
- Client API:
docs/html/classdatabricks_1_1_client.html - SQL client reference
- Files:
docs/html/files.html - Browse by file
Example: Viewing Jobs API Docs
# Generate and open Jobs API documentation
doxygen Doxyfile
open docs/html/classdatabricks_1_1_jobs.html
The documentation is automatically generated from the inline comments in header files, ensuring it stays synchronized with the code.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Support
For issues and questions, please open an issue on the GitHub repository.