With services like HDInsight and Azure Databricks, Apache Spark is quickly becoming a major player on the Azure data platform. This post looks at how Apache Spark can be integrated with the Azure SQL services. It's important to note that while I refer to Azure Databricks and Azure SQL Data Warehouse throughout the post, the same use cases and code examples can be used when working with Apache Spark on HDInsight and with Azure SQL DB.
First, there are several good use cases for using Azure Databricks with Azure SQL Data Warehouse (DW). A few of them are listed below.
Batch or Streaming ETL
A common use case is to use Azure Databricks to perform ETL processing and data engineering on big data and/or streaming datasets before inserting the data into Azure SQL DW. Source data from Azure Data Lake Store, Blob storage, or streaming in from services like Kafka on HDInsight or Eventhubs can be transformed and/or aggregated using Spark SQL functions before being inserted into Azure SQL DW.
* Important note: In this post, I'll show how you can insert data directly into Azure SQL DW from Spark, but that doesn't necessarily mean that it is always the "best" way to do it. There are other options such as writing your DataFrame out to a directory on Blob or Data Lake Store and using Polybase to load it into Azure SQL DW. As always, be mindful of your use case and standards being used.
Reference Data and Mashup
Another use case for Azure Databricks with Azure SQL DW is to mashup the data from Azure SQL DW with data from another source. This can be done by reading data from Azure SQL DW into a Spark DataFrame and joining it to another DataFrame from another source. This can also be done by creating a Spark SQL table or view from the Azure SQL DW data and then joining that table or view to an existing Spark SQL or Hive table.
Azure Databricks can also be used with Azure SQL DW for advanced analytics and machine learning scenarios. Machine learning models can be trained using big data sets residing in Data Lake Store or Blob storage (or even Azure SQL DW). When predictions are made on new data, those predictions can be stored and analyzed in Azure SQL DW.
Interacting with Azure SQL DW Using Apache Spark
Apache Spark can connect to Azure SQL DW using the Microsoft SQL Server JDBC driver. Both Azure Databricks and HDInsight Spark clusters come with the JDBC driver already installed. Below are some examples of how the JDBC driver can be used to work with Azure SQL DW.
Import a SQL Table into a Spark DataFrame
In the simple example below, all of data from a SQL table (in this case the SalesLT.Customer table from the AdventureWorksLT database) is loaded into a Spark DataFrame. The top portion of the code sets the parameters for the JDBC connection including the Azure SQL Server name, the credentials, the database, and the table to be loaded into the DataFrame. The rest of the code defines the DataFrame to be loaded using the DataFrameReader.jdbc method in the pyspark.sql module along with the parameters that were just defined.
## Connection parameters ## sqlserver = '#SERVERNAME#.database.windows.net' port = '1433' database = 'AdventureWorksLT' user = '#USERNAME#' pswd = "#PASSWORD" table = 'SalesLT.Customer' ## Load Data Frame ## df1 = spark.read \ .option('user', user) \ .option('password', pswd) \ .jdbc('jdbc:sqlserver://' + sqlserver + ':' + port + ';database=' + database, table)
Here are the results of a show() call on the DataFrame that was just created:
Import a SQL Query into a Spark DataFrame
The code for importing data from a SQL query into a DataFrame is almost identical to the code above except that the table name is replaced by a SQL query in the last parameter. Using a query provides more control of the data being loaded into the DataFrame, as the query is "pushed down" (push-down query) to the SQL server. This allows you to select only the fields you want, to filter out the data you don't need, and to return data from multiple tables using a join. In the example below, I changed the name of the last parameter to 'query' (only because it is not a table... it doesn't matter what you call it). Notice the syntax for the query parameter. It's enclosed by a set of parentheses and is named at the end, just as you would a sub-query in SQL Server.
## Connection parameters ## sqlserver = '#SERVERNAME#.database.windows.net' port = '1433' database = 'AdventureWorksLT' user = '#USERNAME#' pswd = "#PASSWORD#" query = '(select top 100 * from SalesLT.Customer) Cust' ## Load DataFrame ## df1 = spark.read \ .option('user', user) \ .option('password', pswd) \ .jdbc('jdbc:sqlserver://' + sqlserver + ':' + port + ';database=' + database, query)
The schema of the resulting DataFrame is below. You'll note that the field names and data types were all inferred from the data source.
Insert Spark DataFrame Data into a SQL Table
Just as easily as data from SQL Data Warehouse queries and tables can be loaded into Spark DataFrames, the data in Spark DataFrames can be inserted into SQL Data Warehouse tables. In the following simple example, we have a 3 column table in Azure SQL DW called SalesLT.SalesRep. The PySpark code below creates a simple DataFrame and then inserts the contents of the DataFrame into the referenced SalesLT.SalesRep table in Azure SQL DW. It then reads the table into a new DataFrame and shows the results.
# Create table on SQL Server #CREATE TABLE [SalesLT].[SalesRep]( # [SalesRepID] [int] NOT NULL, # [FirstName] [nvarchar](25) NULL, # [LastName] [nvarchar](25) NULL) ## Connection parameters ## sqlserver = '#SERVERNAME#.database.windows.net' port = '1433' database = 'AdventureWorksLT' user = '#USERNAME#' pswd = '#PASSWORD#' # Create DataFrame to insert into the Azure SQL table df1 = spark.createDataFrame([(1, 'Luke', 'Skywalker')], ['SalesRepId', 'FirstName', 'LastName']) # Insert the rows into the Azure SQL table df1.write \ .option('user', user) \ .option('password', pswd) \ .jdbc('jdbc:sqlserver://' + sqlserver + ':' + port + ';database=' + database, 'SalesLT.SalesRep', mode = 'append' ) # Read the table into Spark and show the results spark.read \ .option('user', user) \ .option('password', pswd) \ .jdbc('jdbc:sqlserver://' + sqlserver + ':' + port + ';database=' + database, 'SalesLT.SalesRep') \ .show()
Notice the use of "mode = append". This allows us to insert data into the table without messing with any of the existing data int the table. This means that if you change the contents of df1 to something like (2, 'Han', 'Solo') and re-execute the code, you should see 2 rows in the resulting table. If you change the mode to "overwrite", then the table will be dropped and recreated with only the new rows.
A final interesting note on inserting into an Azure SQL DW table from a Spark DataFrame is that if the Azure SQL DW table doesn't already exist, Spark will create the table for you using the schema from the DataFrame. This could be helpful in scenarios where you are in development and want to quickly get the DDL created for data from a DataFrame or if you are loading a staging database where staging tables get dropped each day after processing to save on storage costs in Azure.
Reference SQL DW Table as Spark SQL Table
Finally, we'll look at how you can reference tables in SQL DW as SparkSQL tables or views. Spark supports writing SQL-like queries using HiveQL. The examples below show how you can create a Spark SQL table that references an Azure SQL DW table using the JDBC connection. This allows you to query the table, insert data into the table, and even join the table with other Hive or Spark SQL tables
The first example simply defines a Spark SQL table from an Azure SQL DW table using the JDBC connection. You can see that once the table is defined, it can be queried using a simple HiveQL query. Keep in mind that the data for the table is not actually being stored in Spark SQL/Hive, but instead it's being stored in Azure SQL DW and just referenced through Spark SQL.
%sql -- Create Table CREATE TABLE SalesRep USING org.apache.spark.sql.jdbc OPTIONS ( url 'jdbc:sqlserver://#SERVERNAME#.database.windows.net:1433;database=AdventureWorksLT', dbtable 'SalesLT.SalesRep', user '#USERNAME#', password "#PASSWORD#" ); SELECT * FROM SalesRep;
After executing this code block in an Azure Databricks notebook, you should see the rows that were inserted from the previous section in the blog.
The next example is a simple INSERT into the SalesRep table. Again, remember that the row of data is actually getting inserted into the Azure SQL DW table (you can validate this by opening up SQL Management Studio and querying the table before and after executing the code block below).
%sql INSERT INTO TABLE SalesRep VALUES(3, 'Princess', 'Leia'); SELECT * FROM SalesRep;
This example will append the row to the existing data in the table. If instead you use "INSERT OVERWRITE TABLE", the existing table in Azure SQL DW (SalesLT.SalesRep) would have been dropped and recreated with the new data.
Finally, keep in mind that you can INSERT from a SELECT statement as well. This would allow you to select and transform data from potentially multiple tables in Spark SQL/Hive and insert the results into an Azure SQL DW.
The use cases and examples above definitely don't cover all the possible ways to use Azure SQL DW or Azure SQL DB with Azure Databricks or Spark on HDInsight. Hopefully this post has given you an intro to how these 2 services can be used together, and "Spark"ed (pun intended...) some ideas for additional use cases and more advanced techniques.
The code from this post can be found here. It includes an Azure Databricks notebook exported in 2 different forms: Databricks archive (.dbc) and an HTML file. Either of these can be imported directly into Azure Databricks or the HTML file can just be saved and viewed on its own.
Keep an eye out for more Azure Databricks content coming soon!