pmareport package¶
Submodules¶
pmareport.predictors module¶
The model used to predict appointment duration is a decision tree. The model is evaluated by the precentage of predicted times that are within a threshold (5 minutes by default) of the actual duration.
The class DurationPredictor splits the data into testing and training, builds the model (using scikit-learn’s implementation of decision tree) and evaluates the model both on a cross validation split of the training set and on the test set.
DurationPredictor also includes functionality to turn non-integer categorical features into ints, which scikit-learn’s decision tree implementation requires.
-
class
pmareport.predictors.DurationPredictor(df, feat_cols, response_col)[source]¶ Bases:
objectA model to predict the duration of an appointment.
For example, let’s make a dataframe with random data in columns feat1 and response.
>>> df = pd.DataFrame(np.random.randn(30,2), columns=['feat1', 'response'])
We add a column feat2 with categorical values (‘a’ or ‘b’).
>>> df['feat2'] = np.random.choice(['a', 'b'], 30)
Let’s make a DurationPredictor object from our example dataframe.
>>> dec_pred = DurationPredictor( ... df=df, ... feat_cols=['feat1', 'feat2'], ... response_col='response' ... )
To turn feat2 into a column of ints (which scikit-learn’s decision tree implementation requires), we use make_int.
>>> dec_pred.make_int(col='feat2')
We split our data set into train and test with 10% left out to test.
>>> dec_pred.train_test(test_size=0.1)
Now let’s make the model, a decision tree of maximum depth 3, and get its average score on a 10-fold cross validation split. The score is the percentage of predictions within 5 minutes of the acutal value.
>>> dec_pred.make_model(max_depth=3) >>> cv_score = dec_pred.cv_evalution(thresh=5) >>> cv_score >= 0 and cv_score <= 100 True
Fit the model on the full training set and evaluate it on the test set.
>>> test_score = dec_pred.fit() >>> test_score >= 0 and test_score <= 100 True
Parameters: - df (dataframe) – the data
- feat_cols (list) – a list of the names of the feature columns
- response_col (str) – the name of the response column
-
cv_evalution(n_folds=10, thresh=5)[source]¶ Evaluate the model on a cross valdation split of the training data with n_folds nmber of folds. The metric is the percent of predictions within thresh of the true value.
Parameters: - n_folds (int) – the number of folds for the cross validation
- thresh (float) – the threshold for considering a prediction close to the true value
Returns: the average of metric values over the folds
Return type: float
-
fit(thresh=5)[source]¶ Fit the model on the training set and evaluate it on the test set. The metric is the percent of predictions within thresh of the true value.
Parameters: thresh (float) – the threshold for considering a prediction close to the true value Returns: the score of the model on the test set Return type: float
-
make_int(col)[source]¶ Encode categorical variables of type other than int as ints for input into the decision tree.
Parameters: col (str) – the name of the column with categorical values
-
pmareport.predictors.percent_within(y_true, y_pred, thresh=5)[source]¶ Calculate the percentage of predictions are within thresh of the true value.
Parameters: - y_true (array-like) – the true values
- y_pred (array-like) – the predicted values
- thresh (float) – the threshold for a close prediction
Returns: the percent of predictions within the treshold from the true value
Return type: float