When you look at a diagram like the one above, it can often be intimidating. There seem to be so many services in the Azure Data Platform-as-a-Service offering (aka Cortana Intelligence Suite). What do they all do? Do I need to use them all? Where do I even start with my cloud data and analytics solution? While I won't try and answer all of these questions in this post, I do hope to provide an easier way of thinking about the Cortana Intelligence services.
When building data and analytics solutions in Azure and talking to clients about the same, I like to break most of the services up into their most basic functions: storage and compute. Whether dealing with big data versus small data, schematized/structured data versus unstructured data, or streaming data versus batched data, nearly any analytics pipeline consists of a series of components and steps that are simply storage or compute.
A typical, simple flow consists of 1 or more compute steps being executed against data residing in some form of storage. The result of this compute step is typically some sort of new data being put into some form of storage.
Once you understand this very basic principle, working with data services in Azure becomes much easier. Think of the services as a collection of storage and compute building blocks or Legos that you can piece together in a number of different ways to ultimately create intelligence from raw data.
The graphic above splits most of the services in the Cortana Intelligence Suite into either storage or compute. You'll note that some services show up in both the storage and compute categories. Services that are not included in the graphic are not considered storage or compute services. These services might be used for scheduling and orchestrating pipelines (Data Factory), publishing/exposing data assets to users (Data Catalog), or visualizing data (Power BI).
The Azure services outlined below provide different storage options for analytic workloads.
Blob Storage is one of the simplest and cheapest forms of storage in Azure. It is great for storing files of any type and almost any size. Nearly all of the compute services in Azure can read from and write to Blob storage. This makes it a great option for storing/landing raw source data or being an input/output to nearly any compute step in a pipeline.
Data Lake Store
Data Lake Store is a Hadoop File System (HDFS) in the cloud with no limits on file size and is optimized for massive throughput and parallel processing. It integrates well with big data compute services like Data Lake Analytics and HDInsight, and is ideal for both streaming and batch workloads as well as structured or unstructured data. Use Data Lake Store to store raw source data for future querying and for storing processed outputs of Data Lake Analytics and Hive or Spark jobs. These processed outputs can then be stored and/or used in other compute services like Machine Learning, or they can be processed into more schematized storage services like SQL DW.
SQL Data Warehouse (SQL DW)
SQL Data Warehouse is a massively parallel SQL Server database with petabyte scale in Azure and the ability to be paused when not in use (great cost-savings feature!). It includes many SQL Server storage features including partitioning, indexing, and in-memory storage capabilities. SQL DW is a great choice for storing big, structured, production-ready analytics data that either needs to processed further using SQL stored procedures or Azure Machine Learning, or that is ready to be queried by visualization and reporting tools like Power BI.
Other viable storage services in Azure include Document DB and SQL Database. Document DB is a NoSQL database for JSON documents that is similar to MongoDB. SQL Database is a managed SQL database similar to SQL Data Warehouse, but without some of the enterprise storage features and scalability.
The Azure services outlined below provide different compute options for analytic workloads.
HDInsight provides a managed Apache Hadoop or Spark service in the cloud. Simply processing/querying data using Hive (SQL-like language) scripts on top of data stored in either Blob or Data Lake Store is probably one of the simplest, but most useful use cases for HDInsight. This processing can also become very cost effective if you are creating clusters only when you need them, and deleting them (but leaving all the processed data!) when you are done .
Data Lake Analytics
Data Lake Analytics is a compute service intended to work with Data Lake Store. It uses a VERY Hive-like programming language called U-SQL to query and process data. You can also easily scale up or scale down processing power to meet your speed vs. cost requirements. Unlike HDInsight (which charges you as long as the cluster is active, even if it isn't doing anything), you only get charged with Data Lake Analytics when a U-SQL job is actually running.
Azure Machine Learning (Azure ML) is a compute service for easily building and deploying predictive analytics solutions. Azure ML is capable of reading and writing data to and from most of the Azure storage services, and has both a batch and request-response execution mode. This makes it a logical option for almost any predictive or statistical learning compute step.
SQL Data Warehouse (SQL DW)
SQL DW typically provides compute services in the form of stored procedures. Compute resources can be scaled up or down depending speed vs. cost requirements. SQL DW can also query big data stores (Blob, HDFS, etc.) directly and join data from those big data stores with data housed in the service's local SQL database. This new feature (Polybase) opens the door for easier integration/processing of big data into SQL databases.
Stream Analytics is a compute service used for aggregating and processing event-type data as it comes in. The service is ideal for performing near-real-time analytics on IoT-like (Internet of Things ) data. While the service can push streaming data into most Azure storage options, Stream Analytics has a very specific use case, and is therefor not as easily interchangeable as some of the other compute services.
Remember that the services in the Cortana Intelligence Suite can be simply viewed as a collection of storage and compute building blocks. While there may often be multiple combinations of blocks that yield the same final output, the key is choosing the appropriate storage or compute services that deliver the insights both efficiently and in the most cost effective way.