Tag

Data

Browsing

Databricks on Azure is essential in data, AI and IoT solutions, but the env. automation can be challenging. Azure DevOps is a great tool for automation. Using Pipelines and product CLI integrations can minimise or even remove these challenges. My team is currently working on a cutting edge IoT platform where data flows from edge devices to Azure. We are dealing with data which is sensitive, and under GDPR so no one should have direct access to the data platform in the production environments.

In the project, data is generated by sensors and sent to the cloud by the edge devices. Ingestion, processing and analysis of data are too complicated for the traditional relational databases; for this reason, there are other tools to refine the data. We use DataBricks in our Lambda Architecture to batch process the data at rest and predictive analytics and machine learning. This blog post is about the DataBricks cluster and environment management, and I’ll not go deeper to the architecture or IoT solution.

The Automation Problems

As any reliable project, we have three environments which are development, user acceptance testing (UAT) and production. In my two previous posts, Azure Infrastructure using Terraform and Pipelines and Implement Azure Infrastructure using Terraform and Pipelines, I had an in-depth review and explanation of why and how Terraform solves environment generation and management problems. Let’s have a study the code Terraform provides for Databricks.

resource "azurerm_resource_group" "example" {
  name     = "example-resources"
  location = "West US"
}

resource "azurerm_databricks_workspace" "example" {
  name                = "databricks-test"
  resource_group_name = azurerm_resource_group.example.name
  location            = azurerm_resource_group.example.location
  sku                 = "standard"

  tags = {
    Environment = "Production"
  }
}

Wait a minute, but that is only the empty environment!
What about the Clusters, Pools, Libraries, Secrets and WorkSpaces?

The Solution, DataBricks Automation with Azure DevOps

Fortunately, DataBricks has a CLI which we can be imported for DataBricks environment automation using Azure DevOps Pipelines. The Pipelines enable us to run PowerShell or Bash scripts as a job step. By using the CLI interfaces in our Bash Script, we can create, manage and maintain our Data bricks environments. This approach will remove the need to do any manual work on the Production DataBricks Work Space. Let’s review the bash script.

#!/bin/bash
set -e

CLUSTER_JSON_FILE="./cluster.json"
INSTANCE_POOL_JSON_FILE="./instance-pool.json"
WAIT_TIME=10

wait_for_cluster_running_state () {
  while true; do
    CLUSTER_STATUS=$(databricks clusters get --cluster-id $CLUSTER_ID | jq -r '.state')
    if [[ $CLUSTER_STATUS == "RUNNING" ]]; then
        echo "Operation ready."
        break
    fi
    echo "Cluster is still in pending state, waiting $WAIT_TIME sec.."
    sleep $WAIT_TIME
  done
}

wait_for_pool_running_state () {
  while true; do
    POOL_STATUS=$(databricks instance-pools get --instance-pool-id $POOL_INSTANCE_ID | jq -r '.state')
    if [[ $POOL_STATUS == "ACTIVE" ]]; then
        echo "Operation ready."
        break
    fi
    echo "Pool instance is still in not ready yet, waiting $WAIT_TIME sec.."
    sleep $WAIT_TIME
  done
}

arr=( $(databricks clusters list --output JSON | jq -r '.clusters[].cluster_name'))
echo "Current clusters:"
echo "${arr[@]}"

CLUSTER_NAME=$(cat $CLUSTER_JSON_FILE | jq -r  '.cluster_name')

# Cluster already exists
if [[ " ${arr[@]} " =~ $CLUSTER_NAME ]]; then
    echo 'The cluster is already created, skipping the cluster operation.'
    exit 0
fi

# Cluster does not exist
if [[-z "$arr" || ! " ${arr[@]} " =~ $CLUSTER_NAME ]]; then
  printf "Setting up the databricks environment. Cluster name: %s\n" $CLUSTER_NAME

  #Fetching pool-instances
  POOL_INSTANCES=( $(databricks instance-pools list --output JSON | jq -r 'select(.instance_pools != null) | .instance_pools[].instance_pool_name'))
  POOL_NAME=$(cat $INSTANCE_POOL_JSON_FILE | jq -r  '.instance_pool_name')
  if [[ -z "$POOL_INSTANCES" || ! " ${POOL_INSTANCES[@]} " =~ $POOL_NAME ]]; then
    # Creating the pool-instance
    printf 'Creating new Instance-Pool: %s\n' $POOL_NAME
    POOL_INSTANCE_ID=$(databricks instance-pools create --json-file $INSTANCE_POOL_JSON_FILE | jq -r '.instance_pool_id')
    wait_for_pool_running_state
  fi

  if [[ " ${POOL_INSTANCES[@]} " =~ $POOL_NAME ]]; then
    POOL_INSTANCE_ID=$(databricks instance-pools list --output JSON | jq -r --arg I "$POOL_NAME" '.instance_pools[] | select(.instance_pool_name == $I) | .instance_pool_id')
    printf 'The Pool already exists with id: %s\n' $POOL_INSTANCE_ID
  fi

  # Transforming the cluster JSON
  NEW_CLUSTER_CONFIG=$(cat $CLUSTER_JSON_FILE | jq -r --arg var $POOL_INSTANCE_ID '.instance_pool_id = $var')

  # Creating databricks cluster with the cluster.json values
  printf 'Creating cluster: %s\n' $CLUSTER_NAME

  CLUSTER_ID=$(databricks clusters create --json "$NEW_CLUSTER_CONFIG" | jq -r '.cluster_id')
  wait_for_cluster_running_state

  # Adding cosmosdb Library to the cluster
  printf 'Adding the cosmosdb library to the cluster %s\n' $CLUSTER_ID
  databricks libraries install \
    --cluster-id $CLUSTER_ID \
    --maven-coordinates "com.microsoft.azure:azure-cosmosdb-spark_2.4.0_2.11:1.3.5"
  wait_for_cluster_running_state
  echo 'CosmosDB-Spark -library added successfully.'
  
  echo "Databricks setup created successfully."
fi

First, we will create a Pool for the cluster by waiting for the completion status and the Id. Then we will create a cluster by using the created Pool and wait for the completion. As our cluster gets ready then we will be able to use the cluster id to add Libraries and Workspaces using the following script. there are two support JSON files which include the environment properties.

 cluster.json 
{
    "cluster_name": "main-cluster",
    "spark_version": "6.4.x-scala2.11",
    "autoscale": {
        "min_workers": 1,
        "max_workers": 4
    },
    "instance_pool_id": "FROM_EXTERNAL_SOURCE"
}

pool.json
{
    "instance_pool_name": "main-pool",
    "node_type_id": "Standard_D3_v2",
    "min_idle_instances": 2,
    "idle_instances": 2,
    "idle_instance_auto_termination": 60
}

In our Azure DevOps Pipelines definition first we have to install Python runtime and then DataBricks CLI. By having required environment runtimes then we can run the bash script. Here is the code snippet for the Pipelines step:

- bash: |
          python -m pip install --upgrade pip setuptools wheel
          python -m pip install databricks-cli

          databricks --version
        displayName: Install Databricks CLI

      - bash: |
          cat >~/.databrickscfg <<EOL
          [DEFAULT]
          host = https://westeurope.azuredatabricks.net
          token = $(DATABRICKS_TOKEN)
          EOL
        displayName: Configure Databricks CLI

      - task: ShellScript@2
        inputs:
          workingDirectory: $(Build.SourcesDirectory)/assets/databricks
          scriptPath: $(Build.SourcesDirectory)/assets/databricks/setup.sh
          args: ${{ parameters.project }}-${{ parameters.workspace }}
        displayName: Setup Databricks

To be able to run the script against the Databricks environment you need a token. The token can be generated under the workspace and user settings.

DataBricks Automation with Azure DevOps Pipelines. DataBricks Token.

The environment variables and settings are in JSON files, and the complete solution for DataBricks Automation with Azure DevOps Pipelines and support tool files are available from my GitHub repository.

The process for creating Azure Functions is straightforward on the Azure Portal. The only confusing option you have to consider during the function creation is which hosting model to choose from the available choices. There are four different hosting plans to choose from, where you will also be able to determine which OS to host your functions. In this blog post, I’ll have a review of different choices and what suits you best. This is what you will see on Azure Portal when choosing your hosting plan:

Azure Functions hosting plans for each OS.

Consumption plan

Consumption plan is open on both Windows and Linux plans (Linux currently in the public preview). If you are new to the Azure Functions or need the function just up and running, I would recommend picking this plan, as it will make your life easier and you can get to the coding part rapidly. With this option, the function will dynamically allocate enough compute power or in other words, hosts to run your code and scale up or down automatically as needed. You will pay only for the use and not when for idle time. The bill is based aggregated from all functions within an app on the number of executions, execution time and memory used.

App Service Plan

App Service Plan is the second choice both available on Windows and Linux OS. This plan will dedicate you a virtual machine to run your functions. If you have long-running, continuous, CPU and memory consumable algorithms, this is the option you want to choose to have the most cost-effective hosting plan for the function operation. This plan makes it available to choose from Basic, Standard, Premium, and Isolated SKUs application plans and also connect to your on-premises VNET/VPN networks to communicate with your site data. In this plan, the function will not consume any more than the cost of the VM instance that is allocated. Azure App Service Plans can be found from Microsoft’s official documentation.

An excellent example for choosing the App Service Plan is when the system needs continuously crawl for certain data eighter from on-premises or the internet and save the information to Azure Blob Storage for further processing.

Containers
Azure Functions also supports having custom Linux images and containers. I’ll dedicate a blog post for that option shortly.

Timeouts

The function app timeout duration for Consumption plan by default is five minutes and can be increased to ten minutes in both version one and two. For the App Service plan version one has an unlimited timeout by default but the time out for version two of functions is 30 minutes which can be scaled to unlimited if needed.

After creating the function with a particular hosting plan, you cannot change it afterwards, and the only way is to recreate the Function App. The current hosting plan on the Azure Portal is available under the Overview tab when clicking on the function name. More information about pricing can be found from the Azure functions pricing page.

One of the most popular Azure features is Azure App Services and the Platform as a Service (PaaS) architecture approach. It merely removes the overhead of setting up additional infrastructure, speeds up to get apps up and running and is an economical solution for hosting user faced web apps or API solutions for the web or mobile apps. For the last few years, App Services has played a significant role in the architecture and services I design for the customers.

As the need for background processes increased, Microsoft introduced Azure WebJobs as a part of Azure Web Apps, and it was the first step towards the functional serverless architecture. A WebJob is a workflow step which has a trigger based on time or, e.g. Azure storage features to undertake a specific logical task. WebJobs are a powerful tool to process data and create further actions based on business rules. The downside is that it has a poor modification, monitoring and laborious logging features from the UI compared to Azure Functions.

By publishing the Functions to Azure, it was a game changer in architectural plans and the way handling background processes in the Microsoft cloud. Azure Functions are hosted on-top of Azure Web Apps architecture and can trigger by HTTP requests, time schedules, events in Azure Storage, Service Bus or Azure Event Hub. The full introduction to Functions is available on Microsoft’s documentation.

Functions Apps can be created using developers prefered programming languages like C# or JavaScript either from the Azure Portal using the web editor or using Visual Studio. Cross-platform developers can use the Visual Studio Code for development using their non-windows environments.

Azure Functions have two runtime versions, and there are significant differences between versions one and two. Version 2.X is running in a sandbox, and it will limit access to some specific libraries in C# and .net core. As an example, if your function is manipulating images or videos, you don’t have access to the framework GUI libraries, and you will face exceptions. The version 1.X uses the .NET Framework 4.7 and is a powerful and alternative runtime for processes where full access to .NET Framework libraries are needed. The full list of supported languages and runtimes are available on the Microsft’s documentation.

Here is an example of the usage of Functions:
A client has financial data in different file formats and needs to process the information. The client receives most of the data in text-based PDF format. Using Functions is a perfect way to process textual context from PDF files to create data for search and Artificial intelligence. The following drawing illustrates the architecture.

  1. Azure blob storage to host files and PDF documents
  2. Azure Function which will be triggered as a new file is added to a container
  3. Azure Cosmosdb to save the content of the PDF file as JSON format
  4. Azure Cognitive Services to process textual context