9 May
2018

Data, data, data. //build/ Day 3

Category:Azure

So many of us think of Azure as a great storage platform, and it is. With many offerings tailored to meet your specific needs. Even so, there are also many offerings that go beyond storage and can completely revamp your ELT/ETL scenarios.

There are several technologies mentioned here that aim to help developers do what Microsoft has helped us do from the start: Let developers focus on their business logic, abstracting away platform complexities.

Most of the sessions I attended today demonstrated different technologies for moving and manipulating data. There is a great value in not writing code transforming data yet again. There is value in ensuring data stream time windows are treated properly (not an easy code-it-yourself problem). Capabilities like these are enabling all kinds of connected scenarios. With that in mind, what are some of these technologies?

Data Factory

There are so many issues that come up when bringing data into your organization or from one data source to another. How do we reconcile various formats, different representations of the same data, schema mismatch, monitoring of the pipeline and so much more?

Getting data from one place to another can be a real pain in the @ss. With various formats and DBs involved, the complexity increases non-linearly as we add components to the overall system. Reasons for all this pain can include:

  1. DB or data schema mismatch
  2. Different representation of same data from 2 data points
  3. Corrupt files and data
  4. Scalability
  5. Schema evolution
  6. And so on

This is where Azure Data Factory comes in. Version 2 of Data Factory is now in preview and offers some amazing capabilities.

Data Factory is a fully managed service to help with ingress, processing, and monitoring data movement in what it calls a pipeline. Data Factory helps in the entire data management pipeline:

Connect & Collect > Transform & Enrich > Publish > Monitor

A single Data Factory can handle more than one pipeline at a time. One reason to do this might be to ingest data into a data store and then kick off another pipeline that operates on it, preparing it for ML processing. Finally, a third pipeline for the Data Factory may submit the job to Spark for that machine learning work, the results of which can go into another storage medium for later analysis.

Pipelines are composed of activities, which take in and output datasets. And activities are plentiful! What do you want to do with your data? Where do you want to get it? Where do you want to write it? There’s likely an activity for that. Stringing these activities together in a pipeline is the heart of Data Factory and how to compose a visual workflow right in the Azure Portal. Transforming data from one format to another is another valuable type of activity. I’ll stop on activities because there are too many to cover. See here for a complete listing.

Alerting to anomalies in the data is also supported so you can see where things just don’t fit. For that data, it can be shelved for later inspection.

Databricks

Okay, great! I have my data in Azure. Maybe it has touched disk in a DB, or maybe not, but I want to submit it for some downstream processing.

Intro, Azure Databricks. Databricks is optimized for use with Spark and integrates with all Azure data sources. To give you an idea of where Databricks is headed it is informally being called a “Unified Analytics Platform.” It is a complete managed solution incorporating Spark to transform the data along the way. Most Spark modules (including ML) are supported.

The whole model works from shared Notebooks which allow scheduling run instances, and collaboration between all people involved in a data project.

When an ingesting or transformation load gets too high for the current cluster of compute resources, it does what things on Azure do. It scales. Elasticity for the win. Even better, it scales down, saving money on unneeded compute nodes.

Azure Databricks integrates deeply with Azure databases and stores including SQL Data Warehouse, Cosmos DB, Data Lake Store, and Blob Storage. It also has seamless integration to Power BI so you can gain insights about the data inflight or at rest.

Database Migration Service

The Database Migration Service is a service designed to migrate relational databases from one product to a destination in Azure. It can read from various relational DB sources like SQL Server, Oracle, PostgreSQL, etc.

The tool has a very helpful feature showing what is in your source DB that may not be supported in your destination DB. An example shown was a on-premises SQL Server that contained a stored procedure which made a cross DB join. Of course, this could be a show stopper if you aren’t migrating both DBs, but it is a simple example of the type of things the migration inspection can find. Cool. One hint they mentioned in that the SQL DB Managed Instance will often have fewer migration issues than SQL Azure.

The data transfer is happening by creating and passing up a DB backup, then restoring in Azure. I can, however, set up the transfer job to keep using the on-premises DB until the Azure DB is in place. Not to mention that I don’t have to change my connection strings in my applications. Also, I can migrate just my schema without the data.

As I suspected, a hybrid network configuration required to use this service getting the database from on-premises to Azure.

One customer story reported a Microsoft customer who moves over 500 DBs to Azure before being charged their annual datacenter contract. This was a huge cost savings move and they did it in under 2 months.

Conclusion

I attended several sessions today focused on data and found that we have a developing story here. Our data storage solutions are solid and building strong. Our data management services seem to be just a little behind, and this makes perfect sense. As the DBs themselves settle into their own, the services around them will undoubtedly congeal to declare single source of truth tools for given scenarios. As of the time of this post, it can be a bit difficult to choose the right data service for your scenario. I am confident that will clear up in time.

The tools do exist however, and they are solving real-world problems. If you have read this far, dig into something:

  1. Data Factory
  2. Databricks
  3. Database Migration Service