I recently came across some interesting functionality in Python that I have started trying to use more in my day to day programming. In this post, I want to introduce Python type hints and discuss why I think they could be a really powerful addition to your programming workflow.
If you are an avid Python user, you might find type hints initially quite strange and maybe even unpythonic but hopefully, by the end of this post, I will have convinced you that this is not the case and actually they could really improve your codebase and make your life easier in the long run.
To get a bit of an idea of why they might be useful, let’s take the following code snippet below which defines a function that adds two numbers. We can see that when we input a string for example and try to add it with an integer we get a lovely TypeError. Now, this is a super simple example but if we working with a larger code base we may not notice this error until we actually tried to run the code. This can often be quite problematic.
def add_two_numbers(a, b): return a+b>>> add_two_numbers(1, '3')Traceback (most recent call last):File "<stdin>", line 1, in <module>TypeError: unsupported operand type(s) for +: 'int' and 'str'
Static vs Dynamic Typing
Above we provided a trivial example of a type error. Let’s briefly discuss a little bit about why this problem arises in Python. To do this, we need to discuss static and dynamic typing. If you have ever used programming languages like C or Java you will know that you need to declare data types for the variables that you create. In Java, we can create a string variable using the code below. These variable types are not allowed to change. i.e. if we tried to reassign an int to string variable we would get an error.
String string_variable;
string_variable = "This is a string";
The second important aspect of static typing is when type errors are caught. Because a programming language like java is compiled, the compiler will check for code correctness before any of the code is run. If there are type errors in your code then this is when they will be caught. This provides some advantages over interpreted languages such as Python. In particular, catching errors early can avoid lots of time spent debugging your code.
Python, however, is dynamically typed. This means we do not have to explicitly declare our variable types. It also means that Python checks for any type errors as the code runs. If for example, we had some of these type errors embedded in if-else logic it won’t be caught if the condition is not met. Taking our string variable example above, if we tried to reassign an int to it, the code would run perfectly fine and our string_variable would now be of type int.
Duck Typing
Another concept that is closely related to dynamic typing and Python is duck typing. What in gods name is duck typing I hear you ask? Well I agree, it is a strange name but it comes from the phrase: If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck. We don’t necessarily care about the argument types of an object we care about the object’s behaviour. If it behaves like a duck then for all intents and purposes we can consider it a duck.
Let’s take an example to understand what duck typing is all about. Since I am a data scientist I would like to look at a realistic example using sklearn. If for example, we wanted to code some custom transformations in a modelling pipeline we could create a custom transformer in sklearn and
“Since Scikit-Learn relies on duck typing (not inheritance), all you need is to create a class and implement three methods: fit() (returning self), transform(), and fit_transform()” (Geron 2019).
Below we write a custom transformer to impute numerical data. We can see that as long as we implement the fit and transform methods then this will be completely compatible with sklearn and can be used in pipelines alongside other sklearn transformers. This is made possible by duck typing since python only cares that our class has the same behaviour as a transformer i.e. if it has transform and a fit_transform method then it will work. Btw, there is no fit_transform method below we get this method when we inherit from the TransformerMixin class. Here is a pretty good explanation of the behaviour.
import numpy as np | |
import pandas as pd | |
from sklearn.datasets import make_blobs | |
from sklearn.base import BaseEstimator, TransformerMixin | |
from sklearn.preprocessing import StandardScaler | |
from sklearn.pipeline import Pipeline | |
from sklearn.linear_model import LogisticRegression | |
class CustomImputer(BaseEstimator, TransformerMixin): | |
“””Impute missing data for numerical features.””” | |
def __init__(self, variables=None): | |
if not isinstance(variables, list): | |
self.variables = [variables] | |
else: | |
self.variables = variables | |
def fit(self, X, y = None) : | |
self.imputer_dict_ = {} | |
for feature in self.variables: | |
self.imputer_dict_[feature] = X[feature].mean() | |
return self | |
def transform(self, X): | |
X = X.copy() | |
for feature in self.variables: | |
X[feature].fillna(self.imputer_dict_[feature], inplace=True) | |
return X | |
# generate some data | |
X, y = make_blobs(n_samples=10, centers=3, n_features=4, | |
random_state=0) | |
df = pd.DataFrame(X, columns = [‘X1’, ‘X2’, ‘X3’, ‘X4’]) | |
df[‘X1’].iloc[2:8] = np.nan # add missing values | |
missing_columns = df.columns[df.isnull().any()].values[0] | |
preprocessor = Pipeline(steps=[ | |
(‘imputer’, CustomImputer(missing_columns)), | |
(‘scaler’, StandardScaler())]) | |
lr = Pipeline(steps=[(‘preprocessor’, preprocessor), | |
(‘classifier’, LogisticRegression())]) | |
lr.fit(df, y) |
What are Python Type hints and why should I care?
Ok so we have explained the differences between dynamic, static typing and duck typing, but why does this matter and what has this got to do with type hints? Well, even though Python is a dynamically typed language, we can use type hints to achieve some of the benefits of a statically typed language.
Type Hints (also called type annotations) were first introduced in PEP 484and allow us to mimic the behaviour of statically types languages. I say mimic because the python interpreter completely ignores the type hints. Instead, type hints rely on the user to separately run checks using something like mypy, but more on this later.
So how do we actually make use of type hints? The documentation details all of the different ways you can use type hints but I will talk about some you will tend to use most often. The syntax for type hints is pretty simple and intuitive and if you have ever coded in a statically typed language then this won’t be anything new to you. The main module you will be interacting with if you want to start using type hints is the typing module. It contains many of the types that we will be using such as Lists, Dictionaries and Tuples among others. Let’s take a simple example below where we declare some variables and illustrate some of the types from the library. If we look at the slightly contrived example in process_data we can see that we have two inputs, a list of strings and a list of integers. We also know that our output should be a list of tuples containing strings and integers. In my opinion, this makes it instantly easier to understand the code and it wasn’t too much extra effort to implement.
from typing import List, Dict, Tuple | |
x: int = 10 | |
y: float = 0.8 | |
string: str = ‘I am a string’ | |
hash_map: Dict[str, int] = {} | |
List1: List[int] = [] | |
my_tuple: Tuple[str, int] = () | |
def process_data(years: List[str], values: List[int]) -> List[Tuple[str, int]]: | |
new_list = [(year, value) for year, value in zip(years, values)] | |
return new_list | |
years = [‘2018’, ‘2019’, ‘2020’] | |
value = [100, 200, 300] | |
process_data(years, value) |
Note above that we can use nested type hints which can be quite useful but can also be quite unwieldy if we require two or three layers of nesting. In this case, we can actually assign a type hint to a variable like below (type aliases). Using variables appropriately can greatly improve the readability of your code while maintaining the advantages of type hints.
years = Tuple[str, int]
year_list = List[years]
Revisiting our Custom Transformer
Below we update the code we wrote above to include type hints. Most of the type hints should be clear. We can see that we can also use types from pandas which is obviously very useful as the majority of data analysis in python uses pandas in some way. You might be wondering why we have included “CustomImputer” as a string in the return to the fit method. There is a good reason for this and it is discussed here. It is known as a forward reference since we are referring to a type that has not been created yet. Including it as a string literal addresses this issue. Other than that the code should be pretty clear and hopefully clearer than the original snippet.
Hopefully, you would agree that explicitly stating the data types makes the code immediately clearer. The advantages of this are more obvious when we have large complex code bases but believe me if you need to refactor someone else’s code or need to revisit your own code in the future you will thank yourself for taking the time to add type hints.
Mypy: Static Type Checker
Mypy is a static type checker for Python. It allows us to check our code for common type errors before we actually run anything. This library alongside type hints combines to bring the advantages of statically typed languages such as Java to Python.
Lets take a very simple example outlined in the mypy documentation. We can define a very simple function which takes in a string and prints hello followed by our string input. If we pass in a string and run the type checker we rightly see no errors. If however, we pass an int or any other data type we immediately see an error message. Again, this functionality is particularly useful when we are working on larger projects.
def main(name: str) -> str: return 'Hello ' + nameif __name__=='__main__': main('John') mypy program.py >> Success: no issues found in 1 source filedef main(name: str) -> str: return 'Hello ' + nameif __name__=='__main__': main(10)mypy program.py >> program.py:7: error: Argument 1 to "main" has incompatible type "int"; expected "str" Found 1 error in 1 file (checked 1 source file)
One of the downsides of mypy is that it does not work with external libraries. If you tend to work with numpy and pandas quite a bit than it might be worth exploring data science types and pandas-stub which lets you run type checks for more data science relevant libraries such as pandas, numpy and matplotlib. The data science types ibrary is still in development however, so it doesn’t cover all types available in these libraries.
pip install pandas-stub
pip install data-science-types
Advantages and Disadvantages of Type Hints
Although I am generally a fan of type hints and there are a lot of advantages to using them, they are not without some disadvantages as well. In terms of the advantages, I think one of the biggest ones is
- It forces you to think about your functions inputs and output: i.e. in general, it makes you think more about your code and how it is designed to be used. For me anyway thinking more deeply about the code I am writing has tended to improve my skills in programming. I think this reason alone is a good enough one to start using them.
Also, super important using type hints means your code has
- Clear documentation. Having clear documentation makes your code much easier to read not only for yourself but for others as well. If you have ever taken over someone else’s code and thought, I have no idea what any of this code is doing you will understand how important this is. Having clearly documented code doesnt seem to be that common in data science but it should be.
- Debugging is easier: If it wasn’t clear by now it should be. Using type hints can make it much easier to debug issues in your code and in a lot of cases can completely avoid some errors, particularly if used with a static type checker like mypy.
Another advantage of type hints is that they
- Can be gradually implemented: You are probably thinking that this will add significant overhead in terms of the time it takes to write code compared to what you are used to and you would be right. The good news is that you do not need to refactor your entire codebase in one go. Type hints can be implemented gradually.
ok, so what are the disadvantages? well even though you can implement type hints gradually they still could be a
- Significant time investment: Although the basics of type hints are pretty straightforward there is a whole lot more to them that I did not cover in this post. As well as the learning curve, using type hints obviously involves writing more code so it will likely result in increased development time initially. This could, however, be completely offsite by reducing time spent on debugging. I have no empirical evidence of this but I could well believe it, particularly for larger more complex projects.
- Does it move away from the simplicity and prettiness of python? This is more of an open question than an obvious disadvantage but the hardcore Pythonistas out there may not be willing to desecrate their beautiful code with type hints.
Recommendations and Takeaways
Should you use type hints? Ultimately that is up to you but hopefully, this post has given you some food for thought and will make the decision a little bit easier if you are on the fence. For me, personally, I do like type hints so I try to use them. My advice would be, if you are just doing some quick and dirty data analysis in Jupyter notebooks, the cons of using them probably outweigh the pros so I probably wouldn’t bother. However, if you are working on bigger projects and production code I say give them a try and see if they improve your workflow.