Skip to main content

Delta Sharing open sharing protocol.

Delta Sharing is an open standard for secure data sharing. Nexalis uses Delta Sharing to provide users with seamless access to shared datasets across different environments. This allows you to query data directly in Python, JAVA or Spark without needing complex integration steps, enabling seamless data sharing across organizations. For reference:

Credential File

When Nexalis shares a dataset with you, you will receive an email with a one-time download link. Clicking this link will take you to a Databricks portal where you can download a credential file. This credential file is essential for connecting data to various applications. Safely store this credential, as there won’t be an option to redownload it later. The link should direct you to: Databricks Delta sharing token Click on the blue box and the file named “config.share” should automatically start downloading. The file should contain:
{
	"shareCredentialsVersion": 1,
	"bearerToken": "<SHARING_TOKEN>",
	"endpoint": https://<DELTA_SHARING_ENDPOINT>,
    "expirationTime": "2024-06-18T22:36:37.792Z"
}
This credential file is essential for authentication and contains:
  • shareCredentialsVersion: protocol version used.
  • bearerToken: token that grants you access.
  • endpoint: the Delta Sharing endpoint URL.
  • expirationTime: when the credential becomes invalid.
Save this file securely (e.g., store in a secure path, not under version control). Nexalis will provide it for you, and it cannot be re-downloaded later. You will reference this file whenever you connect to Delta Sharing from your applications. Once expired, you’ll need to request a new file from Nexalis; there is no way to refresh it yourself.

Consuming the data

In this method, Nexalis provides data through the open Delta Sharing protocol, which is not limited to Python. You can access the same shared datasets from multiple programming languages and tools, such as R, Scala, Java, and BI platforms. The tutorial below focuses on Python and Spark because they are the most common for data analysis and pipelines, but you are free to use other environments. For additional examples, please refer to the official Delta Sharing documentation and tutorials:

Python Example

Delta Sharing with Python and Spark — How to Run Locally

Nexalis provides you with a secure credential file (config.share) and the fully qualified table names you are allowed to access. You can read these shared tables on your machine in two ways:
  • Method A — Embedded Python: run from a normal Python process and create a SparkSession in-process. Best for ad-hoc exploration and small pulls.
  • Method B — spark-submit: run a standalone local Spark job with explicit packages and classpath. Best for larger data, repeatable jobs, and especially for real-time/“only new data” streaming using Spark Structured Streaming.

What Nexalis Provides

Nexalis will share with you:
  • The credential profile file: config.share (keep it safe and do not alter it).
  • One or more Delta Sharing table names in the format:
./config.share#<user-name>.<client-name>.<table>
Replace <user-name> with your assigned Nexalis username.
Replace <client-name> with your organization’s client name.
Replace <table> with the shared table name.

Prerequisites

  • A supported Java runtime (JDK 8/11/17). Make sure JAVA_HOME points to it.
  • For Method A: Python 3.8+ and pip.
  • For Method B: Local Spark 3.4.2 (see installation steps below).

Method A — Embedded Python (Spark created in-process)

When to use: quick exploration, notebooks, small to medium pulls, minimal setup. How it works: install Python libraries, keep config.share next to your script, start a SparkSession inside Python, and query the shared table.

Steps

  1. Install the required libraries:
pip install delta-sharing pyspark pandas
  1. Place config.share in a safe location (for example, next to your script).
  2. Build the table URL in your script:
table_url = "./config.share#<user_name>.<client_name>.<table>"
  1. Minimal example (batch read):
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DeltaSharingLocal").getOrCreate()

df = (
    spark.read.format("deltasharing")
    .load(table_url)
    .where("tsConnector > 1742258615000 AND siteName = 'siteXYZ' AND dataPoint = 'ACTIVE_POWER'")
    .select("siteName", "dataPoint", "value", "unit", "tsConnector")
)

pdf = df.toPandas()
print(pdf.head())
Notes:
  • tsConnector is an epoch timestamp in milliseconds. Adjust filters to your time window.
  • This method produces a batch snapshot. For larger pulls or continuous updates, use Method B.

Method B — spark-submit (standalone Spark)

When to use: heavier data, repeatable jobs with full logs, or when you need real-time “only new data” ingestion. How it works: install Spark locally, then launch your script with spark-submit, including the Delta Sharing connector package.

One-time Spark Setup

wget https://archive.apache.org/dist/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3.tgz
tar xzf spark-3.4.2-bin-hadoop3.tgz
export SPARK_HOME=$PWD/spark-3.4.2-bin-hadoop3
export PATH=$SPARK_HOME/bin:$PATH

Batch launch template (snapshot reads)

spark-submit   --packages io.delta:delta-sharing-spark_2.12:0.6.4   your_script.py   --delta_table_path './config.share#<user_name>.<client_name>.<table>'

Real-Time Refresh with Structured Streaming

For continuous ingestion of only new data, use Spark Structured Streaming. Unlike periodic batch jobs, a streaming job runs continuously, processing micro-batches (e.g., every 30 seconds). Spark automatically tracks progress in a checkpoint so that previously processed rows are not re-read.

Minimal streaming example

from pyspark.sql import SparkSession

table_url = "./config.share#<user_name>.<client_name>.<table>"

spark = (
    SparkSession.builder
    .appName("DeltaSharingStreaming")
    .getOrCreate()
)

# Define the streaming DataFrame
stream_df = (
    spark.readStream
    .format("deltasharing")
    .load(table_url)
    .where("siteName = 'siteXYZ'")
)

# Example sink: console (demo)
query = (
    stream_df.writeStream
    .format("console")
    .outputMode("append")
    .trigger(processingTime="30 seconds")
    .option("truncate", "false")
    .option("checkpointLocation", "/tmp/nexalis_stream_out/_chkpt")
    .start()
)

query.awaitTermination()

Writing Streaming Data to Outputs

Every Structured Streaming job must define a sink (output destination). Common options:

Console (testing/debugging)

Displays records in the terminal.
query = (
    stream_df.writeStream
    .format("console")
    .outputMode("append")
    .start()
)

Parquet / Delta Files (persistent storage)

Stores results for later queries or integration into pipelines.
query = (
    stream_df.writeStream
    .format("parquet")
    .option("path", "/tmp/nexalis_stream_out/parquet")
    .option("checkpointLocation", "/tmp/nexalis_stream_out/_chkpt")
    .start()
)

Database / API (integration with dashboards or apps)

Pushes each micro-batch into an external system.
def save_to_db(batch_df, batch_id):
    batch_df.write \
        .format("jdbc") \
        .option("url", "jdbc:postgresql://dbserver/mydb") \
        .option("dbtable", "streaming_results") \
        .option("user", "dbuser") \
        .option("password", "dbpass") \
        .mode("append") \
        .save()

query = (
    stream_df.writeStream
    .foreachBatch(save_to_db)
    .start()
)
⚠️ Always configure a checkpointLocation to ensure Spark tracks what data has already been processed.

Launching with spark-submit

spark-submit \\
  --packages io.delta:delta-sharing-spark_2.12:0.6.4 \\
  streaming_reader.py \\
  --delta_table_path './config.share#<user_name>.<client_name>.<table>'

Why Streaming is Different from Periodic Batch

  • Periodic batch (cron + spark-submit): each run starts fresh. Without custom logic (like tracking the last timestamp), it may re-read old data.
  • Structured Streaming (spark.readStream): one long-lived Spark job. It tracks progress in checkpoints and automatically processes only new data each trigger.

Choosing a Method

MethodBest ForLimitationsExample Use Cases
Method A — Embedded PythonQuick exploration, notebooks, small to medium data pullsNot optimized for large-scale jobs; limited logging and monitoringInteractive analysis, prototyping in Jupyter notebooks, testing queries
Method B — spark-submit (Batch)Large repeatable snapshots, ETL jobs, controlled pipelinesRequires Spark installation and setup; each run starts fresh (may re-read old data if not handled)Scheduled data extractions, periodic reporting, building data pipelines
Method B — spark-submit (Streaming)Near real-time ingestion and continuous updatesMore complex setup; requires checkpointing and monitoring; long-running jobReal-time dashboards, alerting systems, streaming ETL into databases

Checklist to Avoid Pitfalls

  1. Java/Spark prerequisites: ensure JAVA_HOME points to a compatible JDK (8/11/17).
  2. Version alignment: keep Spark at 3.4.2 and the Delta Sharing connector at io.delta:delta-sharing-spark_2.12:0.6.4.
  3. Path to config.share: use a correct relative or absolute path.
  4. Time filters: tsConnector is in epoch milliseconds. Convert your time windows.
  5. Local parallelism: add --master local[*] if you want Spark to use all cores.
  6. Streaming durability: always configure a checkpointLocation.
  7. Table access: make sure <user-name>.<client-name>.<table> matches exactly what Nexalis shared with you.