[4]{.chapter-number}  [Random forest]{.chapter-title}

4 Random forest

The random forest (RF) algorithm is probably one of the most famous ML algorithms, and not without reason. Compared to other well performing algorithms, the RF algorithm has only a few hyper-parameters and because of the bagging and the random sampling of available variables in for the node splits, it has a well working internal complexity adaption.

In the following, we use the ‘ranger’ package (Wright and Ziegler (2017)) (Python: ‘scikit-learn’ (Pedregosa et al. (2011)), Julia: ‘MLJ’ (Blaom et al. (2019))).

4.1 Classification

library(ranger)
X = iris[,1:4]
Y = iris[,5,drop=FALSE]
data = cbind(Y, X)

rf = ranger(Species~., data = data, probability = TRUE, importance = "impurity")

Show feature importances:

importance(rf)

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    9.067816     1.358848    41.845718    43.340283

Make predictions (class probabilities):

head(predict(rf, data = data)$predictions, n = 3)

        setosa   versicolor    virginica
[1,] 1.0000000 0.0000000000 0.0000000000
[2,] 0.9995556 0.0002222222 0.0002222222
[3,] 1.0000000 0.0000000000 0.0000000000

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.preprocessing import scale
iris = datasets.load_iris()
X = scale(iris.data)
Y = iris.target

model = RandomForestClassifier().fit(X, Y)

Feature importance

print(model.feature_importances_)

[0.1047826  0.02722545 0.43783582 0.43015613]

Make predictions:

model.predict_proba(X)[0:10,:]

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

import StatsBase;
using MLJ;
RF_classifier = @load RandomForestClassifier pkg=DecisionTree;
using RDatasets;
using StatsBase;
using DataFrames;

iris = dataset("datasets", "iris");
X = mapcols(StatsBase.zscore, iris[:, 1:4]);
Y = iris[:, 5];

Models:

model = fit!(machine(RF_classifier(), X, Y))

trained Machine; caches model-specific representations of data
  model: RandomForestClassifier(max_depth = -1, …)
  args: 
    1:  Source @613 ⏎ Table{AbstractVector{Continuous}}
    2:  Source @784 ⏎ AbstractVector{Multiclass{3}}

Feature importance:

feature_importances(model)

4-element Vector{Pair{Symbol, Float64}}:
 :PetalLength => 0.51551371534272
  :PetalWidth => 0.3981451261378913
 :SepalLength => 0.06998182233047624
  :SepalWidth => 0.01635933618891258

Predictions:

MLJ.predict(model, X)[1:5]

5-element CategoricalDistributions.UnivariateFiniteVector{Multiclass{3}, String, UInt8, Float64}:
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)

4.2 Regression

library(ranger)
X = iris[,2:4]
data = cbind(iris[,1,drop=FALSE], X)

rf = ranger(Sepal.Length~., data = data, importance = "impurity")

Show feature importances:

importance(rf)

 Sepal.Width Petal.Length  Petal.Width 
    11.72733     46.86181     37.13289

Make predictions (class probabilities):

head(predict(rf, data = data)$predictions, n = 3)

[1] 5.104768 4.774441 4.649346

from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
from sklearn.preprocessing import scale
iris = datasets.load_iris()
data = iris.data
X = scale(data[:,1:4])
Y = data[:,0]

model = RandomForestRegressor().fit(X, Y)

Feature importance:

print(model.feature_importances_)

[0.07991512 0.85830207 0.06178281]

Make predictions:

model.predict(X)[0:10]

array([5.106     , 4.8205    , 4.57298571, 4.76945   , 5.017     ,
       5.429     , 4.80283333, 5.06201667, 4.5855    , 4.856     ])

import StatsBase;
using MLJ;
RF_regressor = @load RandomForestRegressor pkg=DecisionTree;
using RDatasets;
using DataFrames;

iris = dataset("datasets", "iris");
X = mapcols(StatsBase.zscore, iris[:, 2:4]);
Y = iris[:, 1];

Model:

model = fit!(machine(RF_regressor(), X, Y))

trained Machine; caches model-specific representations of data
  model: RandomForestRegressor(max_depth = -1, …)
  args: 
    1:  Source @316 ⏎ Table{AbstractVector{Continuous}}
    2:  Source @129 ⏎ AbstractVector{Continuous}

Feature importance:

feature_importances(model)

3-element Vector{Pair{Symbol, Float64}}:
 :PetalLength => 0.6626304609310221
  :PetalWidth => 0.23647943010293143
  :SepalWidth => 0.10089010896604662

Predictions:

MLJ.predict(model, X)[1:5]

5-element Vector{Float64}:
 5.1000000000000005
 4.659999999999999
 4.62
 4.720000000000001
 5.0600000000000005