More and more data scientists are using cloud database services for a host of different reasons:
- Ease of provisioning
What the advantages boil down to for a data scientist is simple: they no longer have to spend time with the dreary mechanics of setting up all the infrastructure, and can instead focus immediately and directly on collecting and analyzing data.
- Syracuse University - M.S. in Applied Data Science: GRE Waivers available
- SMU - Master of Science in Data Science - Bachelor's Degree Required.
- UC Berkeley - Master of Information and Data Science Online - Bachelor's Degree Required.
- Syracuse University - Master of Information Management Online
- Villanova Business - Master's in Analytics and Study Data Mining, Predictive Analytics Online
Cloud computing has changed every aspect of Information Technology, and data science is no exception. The advantages of a multi-tenant architecture has lead to the convenience of on-demand scalability, low-cost performance computing, and easy access from anywhere in the world.
With cloud computing coming to dominate IT services, it’s a sure bet that data scientists will find themselves working with one or more cloud database service.
But there is still one important decision they have to make before getting down to the crunch: which cloud database service to use?
Choose Your Weapon: Cloud Based Database Options
Almost any type of database can be found as a cloud-based service today, but there are only a handful of popular providers. Most of them offer more than one type of database, allowing data scientists to choose the best solution for the task at hand.
Generally, cloud-based databases can be grouped into two camps:
- Those in which the database is offered as a managed service, where users simply make calls directly into a shared store where the provider takes care of security and storage.
- Those where the database is offered as a complete instanced version of a popular database engine, akin to a virtual server running Oracle or MySQL, for instance.
The latter type of service offers more discrete control over the database setup and can be combined with other computing services determined by the user. If a data scientist wanted to set up an R program on AWS, for example, the only option is to set up an Amazon Machine Image (essentially, a preconfigured virtual server) with RStudio installed and make the connections manually with the data source.
Managed services remove some of the ability to directly control what is happening in the database engine, but they also take away the headache of configuration and maintenance of the database instance. They are faster to provision and scale, requiring only a credit card to set up.
Almost all cloud databases offer web-based provisioning and management consoles, allowing easy setup and configuration changes. Additionally, they provide APIs to allow direct calls to the underlying data from external programs or services. Finally, most provide some sort of backup or high-availability service, abstracting the mechanics of configuring redundant machines and executing backup jobs so that users do not have to think about them.
Amazon Web Services
Amazon kicked off the whole cloud-based services business and remains the largest provider with the most extensive library of services and broadest geographic distribution of computing clusters. Correspondingly, Amazon offers the greatest variety of cloud databases and the most possible integrations with other useful services. AWS pricing also tends to be lower than competitors, depending on specific configurations and services.
- Relational Database Service (RDS) offers a variety of popular RDBMS engines including Aurora, PostgreSQL, MySQL, Oracle, and SQL Server
- DynamoDB is a managed NoSQL database
- ElastiCache is a fast in-memory cache service with a choice of two different caching engines, Memcached, or the NoSQL key-value store Redis
- Redshift is a data warehousing service that integrates with the other Amazon database offerings to store up to petabyte-scale volumes of data
- Finally, through Amazon’s Elastic Computing Cloud (EC2) and Elastic Block Storage (EBS), customers can install a wide variety of virtual machine instances that come preconfigured with various databases, or even their own custom database engine
It shouldn’t be a big surprise that Microsoft’s cloud database offering is centered on its popular enterprise RDBMS, SQL Server. But the company has also branched out with offerings to accommodate users who want faster, non-relational database solutions. A unique advantage to Azure is the ability to “stretch” existing single-tenant SQL Server stores into the cloud for backup or analysis. Azure is a good choice for data scientists who are familiar with or need to integrate with other Microsoft tools and services.
- SQL Database is a cloud-based implementation of the standard SQL Server engine, with all the bells and whistles that RDBMS is known for, including Active Directory and Microsoft System Center integration
- DocumentDB is a NoSQL document store database
- Redis Cache is an implementation of the popular NoSQL key-value store
- SQL Data Warehouse offers warehousing for large data volumes at cost-effective prices
Despite a major investment in scalable, general purpose computing power, Google got into cloud service offerings quite a lot later than Amazon and it shows. A comparatively paltry selection of services, less geographic distribution, and a less sophisticated support and provisioning system all make AppEngine seem primitive by comparison to both Azure and AWS.
On the other hand, Google offers a certain amount of usage completely free, and their accounting is more finely grained than any competitor, making it possible to execute certain jobs less expensively. Python, the primary supported language for AppEngine, is also already popular with many data scientists.
- Cloud Bigtable makes the same powerful NoSQL datastore behind many Google services (including Search, Gmail, and Maps) available directly to customers
- Cloud Datastore is another NoSQL data store
- Cloud SQL is a managed RDBMS using the MySQL engine
Heroku was one of the first cloud platforms, initially designed as a hosting service for the popular Ruby programming language. Since then, it has expanded its horizons and been purchased by Salesforce, another major cloud player.
Uniquely, Heroku supports add-on elements from third-party developers, which expand its database offerings to more than a dozen different engines.
- PostgreSQL is a common open-source SQL database provided as a managed service
- Regis is an implementation of the popular NoSQL key-value store
- Add-on supported database engines on Heroku include:
Other Honorable Mentions
There are as many types of cloud-database providers as there are database engines. Specialized services for various niche markets spring up constantly, offering a slight edge in some aspect or another of hosting service when compared to their larger competitors. These smaller vendors may be more nimble and more willing to customize their platform for particular clients– they are also more likely to go out of business or be snapped up by larger companies, throwing their stability into question.
Cloudant – Cloudant is IBM’s implementation of the CouchDB NoSQL database engine, offering some useful proprietary extensions of that open-source project.
Rackspace – A large hosting provider that has transitioned to cloud services, Rackspace offer MySQL RDBMS in virtual or managed configurations, as well as a rebranded NoSQL offering from Cloudant.
SAP – A big player in enterprise software, SAP now offers a platform called HANA which provides some of the database features of that software via the cloud.
mLab – mLab provides dedicated cloud hosting of the popular NoSQL MongoDB database engine.
EnterpriseDB – EnterpriseDB provides PostgreSQL and Oracle RDBMS in a business-focused managed cloud environment while offering consulting and other ancillary services neglected by other providers.
An Example of How Cloud Databases Have Revolutionized How We Access Data
In 2010, online streaming video provider Netflix exploded in popularity. The service watched its daily requests expand 37-fold, putting unprecedented strain on their in-house servers. The company literally was unable to build new data centers at a fast enough rate to keep up with the expansion.
So Netflix turned to a company that had already gone through all those growing pains to host its database services: Amazon.
Shifting from in-house Oracle databases to Amazon Web Services (AWS)-hosted Cassandra NoSQL databases gave Netflix the ability to scale to over a million write operations per second. Moreover, with Amazon minding all the hardware, Netflix engineers were free to concentrate on their core business of streaming digital video.
After a seven-year migration process, Netflix has finally finished moving every aspect of its data processing operation over to AWS in 2016, and the company isn’t looking back.