R is primarily used for statistical analysis, while Python provides a more general approach to data science. R and Python are object-oriented towards data science for programming language. Learning both is an ideal solution. Python is a common-purpose language with a readable syntax. — www.calltutors.com
The war between R and Python users has been raging for several years. With most of the old school statisticians being trained on R and most computer science and data science departments in universities instead preferring Python, both have pros and cons. The main cons I have noticed in practice are in the packages that are available for each language.
As of 2019, the R packages for cluster analysis and splines are superior to the Python packages of the same kind. In this article, I will show you, with coded examples, how to take R functions and datasets and import and utilize then within a Python-based Jupyter notebook.
The topics of this article are:
- Importing (base) R functions
- Importing R library functions
- Populating vectors R understands
- Populating dataframes R understands
- Populating formulas R understands
- Running models in R
- Getting results back to Python
- Getting model predictions in R
- Plotting in R
- Reading R’s documentation
There is an accompanying notebook to this article that can be found on my GitHub page.mrdragonbear/ArticlesThis repository contains code-related content relevant to my Medium articles. This repository is the master repository…github.com
Linear/Polynomial Regression
Firstly, we will look at performing basic linear and polynomial regression using imported R functions. We will examine a dataset looking at diabetes with information about C-peptide concentrations and acidity variables. Do not worry about the contents of the model, this is a commonly used example in the field of generalized additive models, which we will look at later in the article.
diab = pd.read_csv("data/diabetes.csv")print(""" # Variables are: # subject: subject ID number # age: age diagnosed with diabetes # acidity: a measure of acidity called base deficit # y: natural log of serum C-peptide concentration # # Original source is Sockett et al. (1987) # mentioned in Hastie and Tibshirani's book # "Generalized Additive Models". """)display(diab.head()) display(diab.dtypes) display(diab.describe())
We can then plot the data:
ax0 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data") #plotting direclty from pandas! ax0.set_xlabel("Age at Diagnosis") ax0.set_ylabel("Log C-Peptide Concentration");
Linear/Polynomial Regression, but make it R
After this section, we’ll know everything we need to in order to work with R models. The rest of the lab is just applying these concepts to run particular models. This section, therefore, is your ‘cheat sheet’ for working in R.
What we need to know:
- Importing (base) R functions
- Importing R Library functions
- Populating vectors R understands
- Populating DataFrames R understands
- Populating Formulas R understands
- Running models in R
- Getting results back to Python
- Getting model predictions in R
- Plotting in R
- Reading R’s documentation
Importing R functions
To import R functions we need the rpy2
package. Depending on your environment, you may also need to specify the path to the R home directory. I have given an example below for how to specify this.
# if you're on JupyterHub you may need to specify the path to R #import os #os.environ['R_HOME'] = "/usr/share/anaconda3/lib/R" import rpy2.robjects as robjects
To specify an R function, simply use robjects.r
followed by the name of the package in square brackets as a string. To prevent confusion, I like to use r_
for functions, libraries, and other objects imported from R.
r_lm = robjects.r["lm"] r_predict = robjects.r["predict"] #r_plot = robjects.r["plot"] # more on plotting later #lm() and predict() are two of the most common functions we'll use
Importing R libraries
We can import individual functions, but we can also import entire libraries too. To import an entire library, you can extract the importr
package from rpy2.robjects.packages
.
from rpy2.robjects.packages import importr #r_cluster = importr('cluster') #r_cluster.pam;
Populating vectors R understands
To specify a float vector that can interface with Python packages, we can use the robjects.FloatVector
function. The argument to this function references the data array that you wish to convert to an R object, in our case, the age
and y
variables from our diabetes dataset.
r_y = robjects.FloatVector(diab['y']) r_age = robjects.FloatVector(diab['age']) # What happens if we pass the wrong type? # How does r_age display? # How does r_age print?
Populating Dataframes R understands
We can specify individual vectors, and we can also specify entire dataframes. This is done by using the robjects.DataFrame
function. The argument to this function is a dictionary specifying the name and the vector (obtained from robjects.FloatVector
) associated with the name.
diab_r = robjects.DataFrame({"y":r_y, "age":r_age}) # How does diab_r display? # How does diab_r print?
Populating formulas R understands
To specify a formula, for example, for regression, we can use the robjects.Formula
function. This follows the R syntax dependent variable ~ independent variables
. In our case, the output y
is modeled as a function of the age
variable.
simple_formula = robjects.Formula("y~age") simple_formula.environment["y"] = r_y #populate the formula's .environment, so it knows what 'y' and 'age' refer to simple_formula.environment["age"] = r_age
Notice in the above formula we had to specify the FloatVector’s associated with each of the variables in our formula. We have to do this as the formula does not automatically relate our variable names to variables that we have previously specified — they have not yet been associated with the robjects.Formula
object.
Running Models in R
To specify a model, in this case a linear regression model using our previously imported r_lm
function, we need to pass our formula variable as an argument (this will not work unless we pass an R formula object).
diab_lm = r_lm(formula=simple_formula) # the formula object is storing all the needed variables
Instead of specifying each of the individual float vectors related to the robjects.Formula
object, we can reference the dataset in the formula itself (as long as this has been made into an R object itself).
simple_formula = robjects.Formula("y~age") # reset the formula diab_lm = r_lm(formula=simple_formula, data=diab_r) #can also use a 'dumb' formula and pass a dataframe
Getting results back to Python
Using R functions and libraries is great, but we can also analyze our results and get them back to Python for further processing. To look at the output:
diab_lm #the result is already 'in' python, but it's a special object
We can also check the names in our output:
print(diab_lm.names) # view all names
To take the first element of our output:
diab_lm[0] #grab the first element
To take the coefficients:
diab_lm.rx2("coefficients") #use rx2 to get elements by name!
To put the coefficients in a Numpy array:
np.array(diab_lm.rx2("coefficients")) #r vectors can be converted to numpy (but rarely needed)
Getting Predictions
To get predictions using our R model, we can create a prediction dataframe and use the r_predict
function, similar to how it is done using Python.
# make a df to predict on (might just be the validation or test dataframe) predict_df = robjects.DataFrame({"age": robjects.FloatVector(np.linspace(0,16,100))})# call R's predict() function, passing the model and the data predictions = r_predict(diab_lm, predict_df)
We can use the rx2
function to extract the ‘age’ values:
x_vals = predict_df.rx2("age")
We can also plot our data using Python:
ax = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data") ax.set_xlabel("Age at Diagnosis") ax.set_ylabel("Log C-Peptide Concentration");ax.plot(x_vals,predictions); #plt still works with r vectors as input!
We can also fit a 3rd-degree polynomial model and plot the model error bars in two ways:
- Route1: Build a design df with a column for each of
age
,age**2
,age**3
fit2_lm = sm.ols(formula="y ~ age + np.power(age, 2) + np.power(age, 3)",data=diab).fit() poly_predictions = fit2_lm.get_prediction(predict_df).summary_frame() poly_predictions.head()