Decoding
Snowpark And Databricks
A Deep Dive for Data Practitioners
Blog Series Introduction -
In today's rapidly evolving world of data science, the right tools and technologies can often mean the difference between hours of frustration and smooth sailing towards insightful discoveries. As data scientists navigate the intricate landscape of data manipulation, analysis, and model building, they rely heavily on the robust support provided by their chosen tools. In this blog series, we delve into a comparison between two such technologies that have been instrumental in shaping the data science landscape. We will explore the distinctive features of both options and weigh their advantages, ultimately uncovering which one holds the potential to significantly enhance the lives of data scientists, making their complex tasks not only manageable but even enjoyable. So, whether you're a seasoned data professional or just stepping into this captivating field, join us on this exploration to discover the technology that might just save you hours in the way you approach these techniques!
To empower our clients in making informed choices, we delve deep into the realm of various platforms, meticulously analyzing and providing optimal solutions. In this pursuit, we embark on a thorough exploration of Snowpark and undertake a comprehensive comparison with Databricks through a series of enlightening blog posts. By unraveling the intricacies of these platforms, we aim to equip our clients with the knowledge and insights necessary to make the right decisions for their specific needs and aspirations.
Databricks has supported Python in production for many years and embraces the open-source ecosystem including Spark and many data science and machine learning libraries
Snowpark is a new proprietary feature in Snowflake which provides a set of libraries including DataFrame APIs and native Snowpark machine learning APIs requiring you to adapt your code to Snowflake’s platform where it’s the only place it can run and is missing many key features slowing down development.
As part of comparison between Databricks and Snowpark we are presenting a three-part blog series comparing the Machine Learning experience of Databricks with Snowpark and then Data Engineering using Delta Live Tables (DLT) and Snowflake’s Snowpark when developing and deploying an end-to-end ETL pipeline. The focus is on using Python as much as possible, falling back to SQL, Java, CLIs or external compute if needed
Databricks vs Snowpark for Data Science Practitioners:
We aim to compare the use of Databricks Notebooks vs Snowpark Worksheets (aka Snowsight) for machine learning use cases. We wanted to see and compare the ML API and runtimes and explore the various phases of the ML cycle from feature engineering to model deployment (MLops) cycle. In the following sections, we will describe the steps used to share our analysis and the code to help others replicate the same.
· Warm Up : Notebooks vs Worksheets including plotting
· Full Sprint : ML Runtime vs Snowpark Conda/Snowpark ML
Chapter 1: Warm Up: Notebooks vs Worksheets including Plotting -
Databricks steps:
Our team consists of experts in Spark and Python, who tend to minimize their use of SQL and Java whenever possible. Therefore, they will utilize Databricks Notebooks for conducting exploratory data analysis.
1
In the Databricks Notebook, we begin by creating a new notebook. Subsequently, we have the option to upload data to DBFS.
2
We can then upload any local file to a target directory of your choice It then provides a useful copy button to create your Spark DataFrame with the correct file format and the path to the file.
3
Just added one line. option ('inferSchema', 'true') to infer the correct column data types
4
You can then paste and run the code into the notebook to view the data
5
To understand the data, you can run the inbuilt data profiler
6
To understand the data, you can run the inbuilt data profiler
7
Along with this we can use any python plotting library to further explore the data
8
Snowpark steps -
In parallel we attempt to do the same in Snowflake using the new Snowpark Python worksheets.
To get started by uploading the data you first have to create a schema and table.
https://docs.snowflake.com/en/user-guide/data-load-web-ui#loading-data-using-snowsight
1
2
This can be done in Python, but all the DDL still requires SQL. The following code helps in creating the GLOBAL_FACTORY_DEV database
3
Trying to run the above generates an error to use ‘handler’ function
4
We added the function name to run in the ‘handler’ in the settings:
5
This works but it does not return any useful information under results or output
6
Try return the result as a string
7
This now works and shows the schema was created:
8
For the create table DDL we have to open the CSV file locally and try to define the data types which is time-consuming.
9
The table is created:
10
We can now follow the instructions to upload the csv dataset.
11
12
13
We change these 3 options and leave the rest as default:
14
15
Clicking on Query Data only gives a SQL example. Showing the Snowpark and Python is not a prominent feature in Snowflake. We created a function to load the table to preview the data.
16
Before trying to use any python packages (Anaconda) one must accept the terms
17
Post that we can try to generate the similar plots which we tried in Databricks notebook
18
We try to use a Python plotting library to see if it can output the results in a format we would like to see.
19
Looks like seaborn is not available by default as in Databricks.
20
Looks like we can install the seaborn library
21
Along with the matplotlib libraries
22
It looks like they are now installed although there was no indicator.
23
We run the plot function again and it fails.
24
It seems like the conversion from Snowflake data types to Pandas data types are not converting correctly. We try to print out the dtypes to investigate the issue.
25
The pandas df data types are correct, though capitalized, so it’s not clear now what the issue is, and we end at this point to just use Databricks to easily visualize the data.
26
When trying to run the code post setting up the development environment the same code works. However, the conversion of pandas doesn’t rely on pandas package but it requires further installation of snowflake-snowpark-python[pandas]
Conclusion
Databricks simplifies the process of swiftly uploading and analyzing CSV files. Notebooks offer an intuitive interface and the ability to leverage various popular Python plotting libraries.
In contrast, Snowflake necessitates the initial creation of tables before data upload, introducing multiple steps just to begin. Worksheets are less user-friendly when it comes to handling multiple code blocks, representing a significant departure from the favored user experience of notebooks. The extent to which plotting libraries can be utilized remains uncertain, often leading to challenges that are hard to resolve.