"Code quality is by far the biggest problem with Jupyter notebooks today." That's the statement we at Sonar most frequently encounter when we engage with Data Scientists to understand what keeps them awake at night. But why, and is there a solution? Do Jupyter notebooks even need linting? Let's dig in.
So what's the issue?
The rapid growth of big data has driven the demand for quality data analysis, and with it, the rise of Data Science as a critical business function. Data Scientists, often with limited coding experience, turn to prototyping tools such as Jupyter notebooks to model, test, and express ideas quickly, and to deliver valuable insights and intelligence to business leaders. As the sheer volume of data continues to grow, along with the value that businesses place upon the insights Data Scientists can extract, the pressure for these data insights to arrive quickly increases.
However, the tension between speed and code quality is a persistent issue in the world of Data Science. The need to move fast and iterate quickly may come at the cost of the quality of that code. And in the case of Jupyter notebooks, the quality of the code is often cited as the biggest challenge. With a lack of linting tools for Jupyter notebooks to ensure the quality of code without adversely impacting the flow and speed of the Data Scientist, the tension increases.
Do developers already have the answer?
As the saying from Ecclesiastes states there is nothing new under the sun. If we look to the world of the traditional developer there are methodologies and tools which have been successfully implemented in traditional coding environments which have addressed the issue. They have resolved the tension between the quality of code and the need for speed. Yes, we are saying that Data Scientists might learn from developers 🙂
The key has been to empower developers to own their code quality, and to catch and correct errors during the creation process, not as an afterthought. This is achieved through a coding companion that sits unobtrusively in the wings during the code creation process until called upon. When needed, it alerts the developer with pertinent information that is contextual, allowing them to quickly decide on the course of action required, and without breaking the creative flow. This is a proven methodology and tooling, that traditional developers have embraced to ensure clean outputs, in the moment, versus having to circle back later. That is, to deliver clean code the first time, code that is both fit for development and fit for production. So why not apply this same tooling and methodology to Jupyter notebooks and Data Scientists?
But Data Scientists are not traditional developers. The needs of a Data Scientist are different. We know that Data Scientists are using notebooks as a tool by which to model, test, and express ideas. Coding is just a necessary requirement to achieve this, it is not the day job. And it's ok to not be a coder/developer. Prototyping should ideally be a fast, creative process, unhindered by the coding that is required. But at the same time, the results need to be understandable, and the ideas easily absorbed and not misinterpreted. The need for Clean Code remains.
What is this magic?
But what is this tooling and how would it practically apply to Jupyter notebooks? Traditional developers typically use a linter in their IDE to help them find issues, but the opportunity for Data Scientists using Jupyter notebooks goes further, and the needs are different. More than just a linter, Sonar envisions a solution that actually works alongside the Data Scientist to offer a coding companion, empowering creative flow with clean outputs.
Recognizing that the notebook user may not be an experienced coder, the solution should flag any issues that would negatively impact the quality of the code and ultimately the integrity, and portability of the notebook. It would also offer educational guidance, to be accessed if needed, to enable any issues to be easily corrected, without interrupting the flow. Less of a linter, in the traditional coding sense, and more of a companion that offers to explain why an issue has been flagged, what is at stake, and how to fix it.
We at Sonar are dipping our toe in the water with a solution that does just that. A solution that is low effort, with minimal disruption to flow, and empowering Data Scientists to create Jupyter notebooks with quality code. Tackling the biggest perceived issue with notebooks is something we feel ready to do.
Leveraging our experience empowering traditional developers to create Clean Code, we have started humbly with a solution for Jupyter notebooks that instantly flags any issues the moment they are created while offering optional, easy remediation options. After all, Clean Code is also for Data Scientists!
But why bother? What would be the actual benefits of Clean Code for a Data Scientist?
We see 3 key benefits:
Be understood
When you open a Jupyter notebook and start the creative process you want the end result to be clear, and understandable. But the reality is that a notebook is a space for modeling, trying different things, and moving fast.
After a sustained period of data sciencing it is normal to expect pages of code that may have been copy-pasted throughout the notebook. Abbreviations may have been used, there might be variables that don't exist anymore, and the execution count may be in the thousands. In other words, the end result can be messy. The resulting code is perhaps unclean and this adds to the complexity, making the notebook and the results themselves difficult to understand, let alone communicate.
The bottom line is that you, through your untidy notebook, are less likely to be understood.
Now imagine a notebook that has Clean Code, from the get-go. It has been created clean, with the aid of the companion sitting quietly in the wings until it spots an issue. No need to go back over the code and remediate. No lost time, no break in the flow. Just a notebook that is easy to understand. Helping you be understood.
It's easy to jump back in and develop further or share with confidence
Just as Jupyter notebooks are spaces where data modeling takes place, theories are tested and modifications commonplace, it makes sense that notebooks can be shelved for a time, and then revisited with new data, new thoughts, and new ideas. Notebooks may be shared with others, or code snippets posted to developer groups for comment iteration and collaboration. All of the above is made easier when the code style and its structure follow standard developer norms. Not only does a Clean Code approach make it easier for anyone to pick up a notebook and easily carry on where the previous creator left off, but it also ensures that code shared outside of notebooks into the wider developer community is accepted, and portable.
A notebook with Clean Code facilitates onward sharing and ensures easy collaboration and future development.
It helps you grow as a Data Scientist and a coder
It is ok to not be a coder. But if you could learn as you code, it presents an opportunity for personal growth. Using a solution that offers to explain "the why" behind an error, and suggests remediation delivers that chance. Sure, there are times when speed is of the essence, and the quick fix to remediate the issue is just the ticket. But when the moment is right, the companion is ready to put on its teacher's hat and offer contextual learnings that deliver growth.
Plus, having confidence in your notebook and its quality, both in terms of being free from coding issues, as well as being readable and conforming to accepted norms helps you, the author, creator, and owner. It can only protect your reputation, and perhaps even enhance it, based upon the quality of your work.
So how do I get started?
We believe Clean Code should also be for Data Scientists and users of Jupyter notebooks. We also believe that all have the right to access the tools that deliver Clean Code. This is why at Sonar we are taking our first baby steps to offer a free, and open-source solution that will act as your Clean Code companion. Downloadable from the VS Code marketplace, SonarQube for IDE enables Data Scientists to both code and run the Jupyter notebooks within the IDE, acting as a code companion, and teacher, ensuring Clean Code. You can find the official announcement here.
By enabling all Jupyter notebook users to easily write Clean Code without slowing down development velocity, we aim to reduce the tension between the requirement for speed and the quality of the code. But we are not finished. We'd love to hear from you about how we might improve and develop SonarQube for IDE further.