This article provides a high-level overview of Project Jupyter and the widely popular Jupyter notebook technology. The overarching message I’d like to convey is why you should be using Jupyter for your data science projects. I’ve been using it for all my Python machine learning work and I’m quite impressed and satisfied. It’s a great environment with which to develop code, and also communicate results.
Project Jupyter is a nonprofit organization created to “develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.” Spun-off from IPython in 2014 by co-founder Fernando Pérez, Project Jupyter supports execution environments in several dozen languages.
The name “Jupyter” was chosen to bring to mind the ideas and traditions of science and the scientific method. Additionally, the core programming languages supported by Jupyter are Julia, Python, and R. While the name Jupyter is not a direct acronym for these languages (Julia (Ju), Python (Py) and R), it does establish a firm alignment with them.
Jupyter Notebooks
The Jupyter Notebook is an open-source web application that allows data scientists to create and share documents that integrate live code, equations, computational output, visualizations, and other multimedia resources, along with explanatory text in a single document. You can use Jupyter Notebooks for all sorts of data science tasks including data cleaning and transformation, numerical simulation, exploratory data analysis, data visualization, statistical modeling, machine learning, deep learning, and much more.
A Jupyter Notebook provides you with an easy-to-use, interactive data science environment that doesn’t only work as an integrated development environment (IDE), but also as a presentation or educational tool. Jupyter is a way of working with Python inside a virtual “notebook” and is growing in popularity with data scientists in large part due to its flexibility. It gives you a way to combine code, images, plots, comments, etc., in alignment with the step of the “data science process.” Further, it is a form of interactive computing, an environment in which users execute code, see what happens, modify, and repeat in a kind of iterative conversation between the data scientist and data. Data scientists can also use notebooks to create tutorials or interactive manuals for their software. Here is a short instructional video to help get you started with Juypter.
A Jupyter notebook has two components. First, data scientists enter programming code or text in rectangular “cells” in a front-end web page. The browser then passes the code to a back-end “kernel” which runs the code and returns the results. Many Jupyter kernels have been created, supporting dozens of programming languages. The kernels need not reside on the data scientist’s computer. Notebooks can also run in the cloud such as Google’s Collaboratory project. You can even run Jupyter without network access right on your own computer and perform your work locally.
Other Jupyter Tools
JupyterLab (originally launched in beta in January 2018) is commonly viewed as the next-generation user interface for Project Jupyter offering all the familiar building blocks of the classic Jupyter Notebook (notebook, terminal, text editor, file browser, rich outputs, etc.) in a flexible and a more powerful user interface. The basic idea of the Jupyter Lab is to bring all the building blocks that are in the classic notebook, plus some new stuff, under one roof. JupyterLab extends the familiar notebook metaphor with drag-and-drop functionality, as well as file browsers, data viewers, text editors, and a command console. Whereas the standard Jupyter notebook assigns each notebook its own kernel, JupyterLab creates a computing environment that allows these components to be shared. Thus, a data scientist could view a notebook in one window, edit a required data file in another, and log all executed commands in a third – all within a single web-browser interface.
Example of JupyterLab
Two additional tools have enriched Jupyter’s usability. One is JuputerHub, a service that allows institutions to provide Jupyter notebooks across large pools of users. The other is Binder, an open-source service that allows data scientists to use Jupyter notebooks on GitHub in a web browser without having to install the software or any programming libraries.
Platforms Using Jupyter
The popularity of Jupyter goes beyond its use as a stand-alone tool, it’s also integrated with a number of platforms familiar to data scientists.
Anaconda is a prepackaged distribution of Python which contains a number of Python modules and packages, including Jupyter. In fact, Anaconda is the recommended distribution when installing Jupyter. This is how I use Jupyter because I enjoy the flexibility afforded by using the Anaconda Navigator and the ability to define a number of different “environments” with different frameworks like TensorFlow, different Python versions, etc.
Kaggle Kernels are essentially Jupyter notebooks running in the browser, which means you can save yourself the hassle of setting up a local environment by having a Jupyter notebook environment inside your browser and use it anywhere in the world you have an internet connection.
Colab notebooks are Jupyter notebooks that are hosted by Google Colab. Colab enables users to collaborate and run code that exploits Google’s cloud resources, i.e. GPUs, TPUs, and saving documents to Google Drive.
An Amazon SageMaker notebook instance is a fully managed machine learning EC2 compute instance that runs the Jupyter Notebook application. You use the notebook instance to create and manage Jupyter notebooks that you can use to prepare and process data and to train and deploy machine learning models.
Finally, there are many examples of Jupyter notebooks available on GitHub (reviewing them is a good way to learn what’s possible). There are more than 3 million public notebooks today, up from ~200,000 in 2015.
Conclusion
For data scientists, Jupyter has emerged in recent years as a de facto standard. The migration is arguably the fastest into a platform in recent memory. A majority of the ML/DL research papers appearing on the arXiv.prg pre-print server reference Jupyter notebooks that are well-integrated into the research using deep learning frameworks like TensorFlow and PyTorch. The beauty of Jupyter is that it creates a computational narrative, a document that allows researchers to supplement their code and data with analysis, hypothesis, and conjecture. For data scientists, that format can drive creative exploration. If you haven’t already looked at Jupyter technology it is high-time to do so!