different results for Random Forest Regression in R and Python

Question

I am using the same data to do Random Forest Regression in R and Python but I am getting very different R2 values. I understand that hyper parameters might be a reason behind this but I don't think it results in almost halving of R2 scores. I am using the following codes and getting the respective results.

In Python -

    X =  data.drop(['response'],axis=1)
    y = data['response'] 
   
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, random_state = 42)

    rdf = RandomForestRegressor(n_estimators=500,oob_score=True)
    rdf.fit(X_train, y_train)

    print("Random Forest Model Score (on Train)" , ":" , rdf.score(X_train, y_train)*100 , "," ,
          "Random Forest Model Score (on Test)" ,":" , rdf.score(X_test, y_test)*100)   

    y_predicted = rdf.predict(X_train)
    y_test_predicted = rdf.predict(X_test)

    print("Training RMSE", ":", rmse(y_train, y_predicted),
          "Testing RMSE", ":", rmse(y_test, y_test_predicted))


>Random Forest Model Score (on Train) : 92.2312123 , Random Forest Model Score (on Test) : 78.1812321

>Training RMSE : 5.606443558164292e-06   Testing RMSE : 9.59221499904858e-06

In R -

> rows <- sample(0.95*nrow(data))
> train_random <- data[rows,]
> test_random <-  data[-rows,]

> rf_model <- randomForest(response ~ . ,
                         data = train_random,
                         keep.forest=TRUE,
                         importance=TRUE
                         )

> rf_model

Call:
 randomForest(formula = response ~ ., data = train_random, keep.forest = TRUE, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 6

          Mean of squared residuals: 1.437236e-06
                    % Var explained: 42.05
> pred_train <- predict(rf_model,train_random)
> pred_test <- predict(rf_model,test_random)
> R2_Score(pred_train, train_random$response)
[1] 0.9014311
> R2_Score(pred_test, test_random$response)
[1] 0.3616823

I understand that the test train split is not resulting in the same splits but why am I getting such distinctly different R2 values and what is the way to carry out the same Random Forest in R. I have tried using the same hyper parameters I am getting from Python but it is not helping me get the same R2 values in R. Can someone please help me?

Dev · Answer 1 · Apr 12, 2022

Random Forests, as others have mentioned, have a random component, which you probably already knew about.

Random forest, on the other hand, employs bootstrapping, which alters the outcome each time it is performed.I had the same issue with the randomForest function returning various numbers for successive passes. As Zach said, the random forest algorithm generates various subsets of data at random, so the final findings may differ significantly between passes. To get around this, I just ran set.seed(500) before each new pass to reset the seed to 500, and it gave me exactly the same results. I hope it was useful.

Elevate Your Expertise with Our Machine Learning Certification Program!