Go anywhere with Uber

The Evolution of Data Science Workbench

10 June 2021 / Global

Share
Facebook
X social
Linkedin
Envelope
In October 2017, we published an article introducing Data Science Workbench (DSW), our custom, all-in-one toolbox for data science, complex geospatial analytics, and exploratory machine learning. It centralizes everything required to perform data preparation, ad-hoc analyses, model prototyping, workflow scheduling, dashboarding, and collaboration in a single-pane, web-based graphical user interface. 
In this article, we reflect on the evolution of DSW over the last 3 years. We review our journey by looking at how DSW’s usage has evolved to include and supercharge the workflows of more than just data scientists, dive into how the choices we made when designing the platform have helped us scale, offer an in-depth look at our current approach and future goals towards democratizing data science, and also share some lessons that we learned along the way.
Challenges of Unexpected Growth
Since DSW’s launch in 2017, we’ve seen usage skyrocket. Over 4000 monthly active users are leveraging DSW for their data science, complex analytics, and exploratory machine learning needs. Along with the growth, we also see more diverse customer needs. Teams are leveraging DSW for increasingly complex applications, such as pricing, safety, fraud detection, and customer support, among other foundational elements of the trip experience. 
What surprised us the most was that it wasn’t just data scientists and engineers, but also analysts and operations managers were leveraging our tool to solve specific challenges in their individual domains. Figure 1 shows some typical personas of our platform.
Figure 1: The various user personas that rely on Data Science Workbench 
While we did not explicitly expect the platform to attract users from different disciplines, we’re seeing firsthand the benefits of leveling the playing field, by making sure that everyone in our organization has access to the data and advanced tooling they need in order to continuously improve the customer experience. This made clear that DSW needs to be more reliable and scalable, in order to serve its increasing number of users, as well as handle more complex and critical jobs. The platform should also be easy to use not only for skilled Data Scientists, but also for customers with different backgrounds and skill sets.
In the next sections, we will show what we did to strengthen the foundations for scalability and reliability, and the innovations we made to improve the experience for everyone our platform touches.
Strengthening our Foundations
Our growth forced us to rethink some of our assumptions as they related to infrastructure. 

Here are some key improvements we made to DSW, and how it improves the foundations of our platform:
Leveraging Peloton for Improved Resource Management
When we first designed DSW, we made certain assumptions about usage growth and infrastructure requirements. We chose a model where each user session runs an entire instance of our backend service, in a bid to ensure complete isolation for each user. However, as our user base grew and we began building out new features and fixing existing issues, several practical challenges with this approach emerged. Making incremental changes was hard, as it required deploying changes to every single session, interrupting workflows. Moreover, our approach here wasn’t in line with our general recommendation for multi-tenant services, which meant we had to manage our own infrastructure, without access to Uber’s top notch infrastructure management and SRE teams.



We decided to make a hard break from this model and leverage Peloton, our custom resource scheduler for cluster workloads. DSW moved to becoming a containerized environment, with each user still getting a completely isolated session. However, each session was now a docker container with a custom image. This model allowed us to easily update the custom image whenever new changes needed to be pushed; these changes would be available to all new sessions, while existing sessions could continue uninterrupted—users had the choice of upgrading if required. By leveraging a centrally managed pool of resources, we were able to improve our overall uptime, and could also rely on centrally provided services to handle infrastructure maintenance. Finally, we recognized that not all users actually use 100% of the computing resources they request. To further optimize our resource usage, and also account for future growth more easily, we adopted a strategy of oversubscribing the number of sessions that can be hosted on a single compute instance. This allowed us to optimally distribute limited compute resources, while also ensuring we always had enough capacity to accommodate new users.
Making Spark Easier to Use within DSW



Apache Spark® is a foundational piece of our data infrastructure, powering several critical aspects of our business. Spark’s versatility allows users to build applications and run them wherever required. DSW users rely on Spark extensively for their workflows. Early on, we took a bet on giving users easy access to Spark tooling, by allowing users to run PySpark locally in their notebooks, as well as submitting PySpark jobs to a cluster for applications that had larger compute requirements. However, communicating environment changes, ensuring reliability, and standardizing how Spark is being used started becoming major challenges. To make this process seamless, we decided to integrate closely with uSCS, Uber’s Spark-compute-as-a-service solution, to provide users with a transparent way of submitting Spark jobs, without having to worry about specific configuration settings. Since drivers and executors are efficiently allocated remotely, these jobs are much more stable than before. For R users, we adopted the sparklyr package to enable distributed computing based on the core of sparkSQL.
Innovating to Differentiate
Over the last 3 years, besides focusing on ensuring platform reliability, we kept our eyes and ears open so we could identify the little things that we could do to make our users’ workflows just that much more seamless. 
We’ve come up with a few innovations, as well as taking a strategic bet, detailed below:
Enabling Worry-Free Task Scheduling with the ‘Bundle’ Service
When we launched DSW, we built in support for a simple job scheduler that allowed notebooks to be triggered on a predetermined cadence. Over the years, we observed how users leveraged this feature to automate critical and impactful business processes that unfortunately were tied to the temporary nature of user-sessions. DSW sessions are not meant to be long-lived; when sessions are terminated, whether by choice or accidentally, users’ jobs are interrupted, causing downtime on their services. Users have to then manually create a new session and re-assign previous jobs. Moreover, through our conversations with users, we also realized there was a demand for the ability to use notebooks running in DSW sessions as stages in workflows run with other jobs. They want to export their jobs in DSW sessions to be run on some of our other data tools, such as our unified workflow management systems (Piper) and Michelangelo ML Explorer (MLE).
In order to bridge this gap, we made two key additions to DSW: 
The ability to host automated, scheduled jobs on compute resources that are completely isolated from user-sessions in DSW,  not subject to our regular operational changes, which contain all of the customizations required to reliably execute a notebook
The ability to trigger a DSW notebook via APIs, accessible from other internal engineering systems 
These two innovations allowed us to onboard increasingly complex and company-critical automations.
Figure 2 below shows a typical workflow for how a one-click solution can seal a user’s codes/data and environment into an independent bundle, and then launch it from other systems like Piper and MLE. DSW UI provides a bundle management tab where people can register their bundle based on a matured job that they have built in DSW. When the bundle is registered, the codes and environment related with this bundle will automatically be pushed into a permanent storage. Combined with a few prior-built base docker images, we could reproduce the exact environment so that the bundle can be launched independently later. This solution saves a lot of time and computing resources compared with building individual docker images.
We also provided open APIs in DSW for registering and launching bundles from any service or systems. Additionally, we collaborated with the Piper team for adding a new DSW task type, where bundles can be used in Piper as a standard stage.
Figure 2: Architecture graph for DSW Bundle. 
Intelligently Managing Package Dependencies with ‘Snapshots’
DSW users are allocated individual sessions, which come pre-installed with the most commonly used Python or R packages, and can be customized by installing any additional packages that they may need. On average, user sessions contain anywhere between 100 and 150 packages. However,  when sessions are terminated, whether intentionally or accidentally, users lose their customizations and need to manually install every package. This can be incredibly time-consuming (almost 30 minutes to install 100 packages), and makes sharing work or replicating existing analyses very difficult. Existing solutions for capturing package dependencies came with a number of limitations, and were not practical for our diverse user base. With Snapshots, we built a simple, one-click solution that allows users to capture a complete dependency graph of all installed packages. This snapshot can then be used to create new sessions with the same packages that were previously installed, and can further be shared with other users who want to replicate a particular analysis, or with teams that want to work with a common set of installed packages.
Figure 3 shows how a user can utilize the Snapshot service. DSW provides a UI tool to save the current environment for a session, generates a snapshot based on it, and then saves it to a persistent file system. Then the snapshot is available when a user wants to create a new session, restore a crashed session, or share with other users. We also built a Snapshot-as-a-caching-service, which provides a generic interface for any other services or applications to retrieve archived packages. For sessions created on the same snapshot, we cache the environment during the first time when it is built. All later sessions will then pull the environment directly from the cache. This design provides a lightweight strategy that consumes much less computing and storage resources than building docker images.
Figure 3: Architecture graph for DSW Snapshot 
The dependency management tool chosen for managing Python environments was used to be PIP with requirement.txt. However, requirement.txt doesn’t exhaustively lock the versions of a package’s all dependencies. Therefore installing the same set of packages in a different order may result in different environments. To resolve the issue, we leveraged another open sourced Python dependency management tool called Poetry. Poetry resolves dependency conflicts and locks the version of all dependencies. This allows users to preserve a stable and cloneable Python environment.
Building a Community that Shares and Learns Together via Knowledge Base
When our user base was small, it was easy to share and discover exploratory insights and techniques. As the organization has grown, and our platform has been adopted by over 4000 weekly active users, transmitting knowledge across teams is no longer easy. This slows down analysis and the speed of decision making. To make this process simpler, we built DSW Knowledge Base on top of Jupyter nbviewer, as a virtual hub where users can share their knowledge and use cases with each other in a community-driven learning environment. Today we have over 3000 pieces of content published as Jupyter notebooks, which have been viewed over 10,000 times. Users can search by keywords in notebook title, description, and code cells; filter on tags, teams, or authors; upvote content they like; and also directly clone the content they find most useful, so they can use it as a starting point for their own analyses. 
Figure 4 shows the high-level architecture of Knowledge Base. We used Qumulo NFS to store actual notebooks for cloning and display, and MySQL DB to store notebook metadata for other managing operations. Once a user publishes a notebook, we decouple the original notebook from the published notebook by saving an extra copy. Our viewer will then render the copy. We allow users to update or remove published notebook content and metadata, so that they can fix errors, and change content. 
Figure 4: Architecture graph for Knowledge Feed
 
For the search feature in the Knowledge Base, we integrated with Uber’s new generation of searching platform – Sia, empowering users to search keywords in title, description and notebook content.  We extract useful code cells from published notebooks and save them to MySQL DB. Then the triggered DB event data will be sent to Sia’s live ingestion tool for processing.  Once users search in Knowledge Base, our backend will call Sia’s search API by feeding some parameters like data version, key words, analyzer name, etc., and then get returned notebook IDs. Finally our front-end service will display these notebooks to users.
 
To increase discoverability of notebooks, we provided options in the interface to highlight notebooks based on their view count, number of likes, and publish time. Users can also filter the notebooks by tags, teams, or authors. Users can upvote notebooks, and these liked notebooks can be displayed together in the interface for easy lookup.
These are just some of the improvements and innovations that we’ve added to the Data Science Workbench in recent years. The process of designing, building, and launching each of these features has taught us a lot. We believe they’re equally useful for anyone who’s building or managing similar platforms at scale. In the spirit of sharing, let us take you through these learnings in the next section.
Key Learnings
Over the last 3 years, we’ve seen firsthand the importance of exploratory data science and empowering people across the company to do more with their data. This has allowed us to truly accelerate innovation, as evidenced by some of the impactful use cases discussed in the previous section. 
We also learned the challenges that this entails, and the responsibilities that it brings. First, being a ‘good citizen’ is incredibly important: as we worked to democratize access to our tool, we remained laser-focused on maintaining user privacy, file security, and enforcing access control restrictions wherever possible. Next, scaling is hard: with over 3,000 people using our platform every day—a number that has grown exponentially every quarter—we had to continually reassess our technical assumptions and make rapid improvements to keep pace with the growing demand for our product. Finally, reliability is paramount: users trust us to host their analyses and dashboards, and rely on us to continue serving our customers 24x7x365 globally—anything less than 99.99% availability is quite simply unacceptable. 
Basic operational learnings aside, our users validated a lot of our initial goals, but also surprised us in some ways. Our findings led us to 3 key takeaways:
Build for the experts, design for the less technical users
Data Science Workbench was initially conceived as a platform to make data scientists more productive. While a lot of our feature set was designed to easily fit into a data scientist’s workflow, we were surprised to find that many of our operations teams were utilizing this advanced tooling to significantly improve their processes as well. They emerged as true force multipliers, but often needed some hand-holding. It was important for us to ensure we built appropriate guardrails into the platform. 



At the same time, we noticed data scientists enjoying productivity gains in the ML prototyping process, but struggling with the jarring context switch required when productionizing ML models. This finding inspired our investment in PyML and tighter integration with Michelangelo (our production ML platform).
 
Don’t stop at building what’s known; empower people to look for the unknown



When building Data Science Workbench, we went through an extensive user research and consultation process to ensure our limited resources could be directed intelligently to design a platform that met most of our users’ needs. Nailing product-market fit is a great achievement for a platform, but it’s important to also stretch development efforts to chase avenues that may not be on customers’ radars just yet. Tracking industry trends, we chose to focus early on providing our users with GPUs to support deep learning. Our bet paid off, and we’re seeing an increasing number of impactful deep learning applications being prototyped on our platform, with our community using these resources in ways we never could have predicted based solely on surveys.
 
Create communities with both data scientists and non data scientists



By facilitating the contribution of diverse ideas from across the organization, and giving people the tools to explore and operationalize them, teams across the organization were able to harness the real power of our data, without restricting this ability to a small subset of employees who are familiar with data science principles and techniques. Through Data Science Workbench’s Knowledge Base, we were able to achieve cross-pollination and sharing of knowledge among different job functions, all the while keeping nimble on top of a growing collection of data and insights. We like to think that our work is more than just building a tool: it’s giving our users superpowers. 


Having seen the benefits of our platform’s features and capabilities, we’re more excited than ever to work on further innovations that will make it even better. 
Acknowledgments
This product and service was made possible thanks to the hard work of several individuals, named alphabetically, who came together to design and build all aspects of Data Science Workbench: Anne Holler and Jin Sun representing engineering management; Hong Wang, Hongdi Li, Peng Du, Sophie Wang, and Taikun Liu representing the engineering team. Atul Gupte representing the PM team. We also want to thank everyone that has helped build DSW at some point during their time at Uber: Adam Hudson, Ankit Jain, Eric Chen, Hari Subramanian and Yuquan Zhang. Finally, we’d like to thank Christina Luu for helping revise this blog. 

Peng Du

Peng Du is a senior software engineer II with Uber AI. He is the creator of DSW and has technically led DSW going through several evolutions. Before Uber AI, he worked in the Data Infrastructure team, focused on query engines like Hive, Presto and geospatial queries improvement. Prior to Uber, he worked on data/ML problems at Google. He is also champions of many coding competitions like ACM, Google Code Jam, etc.

Taikun Liu

Taikun Liu is a senior software engineer with Uber AI. She majorly worked on the Bundle Project, Peloton Migration, Spark and Security areas in DSW. She currently works on the Machine Learning Artifact Metadata Tracking as part of Canvas project. Before Uber AI, she worked in the Data Infrastructure team, focused on multi datacenter replications, Hadoop resource monitoring and management.

Sophie Wang

Sophie Wang is a software engineer II with Uber AI. She majorly worked on Snapshots Service, Spark and Knowledge feed areas in DSW. She currently works on the dedicated DSW Canvas session project.

Hong Wang

Hong Wang is a software engineer II with Uber AI. He majorly worked on Unified UI and Knowledge feed in DSW. He currently works on the Michelangelo Studio project (consolidates DSW, Michelangelo and MLE). He is the visualization expert in Uber AI team.

Hongdi Li

Hongdi Li was a senior software engineer with Uber AI. He majorly worked on Network File System and other infrastructure areas in DSW. Prior to Uber, he worked on infrastructure problems at Salesforce.