pmareport package

Submodules

pmareport.predictors module

The model used to predict appointment duration is a decision tree. The model is evaluated by the precentage of predicted times that are within a threshold (5 minutes by default) of the actual duration.

The class DurationPredictor splits the data into testing and training, builds the model (using scikit-learn’s implementation of decision tree) and evaluates the model both on a cross validation split of the training set and on the test set.

DurationPredictor also includes functionality to turn non-integer categorical features into ints, which scikit-learn’s decision tree implementation requires.

class pmareport.predictors.DurationPredictor(df, feat_cols, response_col)[source]

Bases: object

A model to predict the duration of an appointment.

For example, let’s make a dataframe with random data in columns feat1 and response.

>>> df = pd.DataFrame(np.random.randn(30,2), columns=['feat1', 'response'])

We add a column feat2 with categorical values (‘a’ or ‘b’).

>>> df['feat2'] = np.random.choice(['a', 'b'], 30)

Let’s make a DurationPredictor object from our example dataframe.

>>> dec_pred = DurationPredictor(
...     df=df,
...     feat_cols=['feat1', 'feat2'],
...     response_col='response'
...     )

To turn feat2 into a column of ints (which scikit-learn’s decision tree implementation requires), we use make_int.

>>> dec_pred.make_int(col='feat2')

We split our data set into train and test with 10% left out to test.

>>> dec_pred.train_test(test_size=0.1)

Now let’s make the model, a decision tree of maximum depth 3, and get its average score on a 10-fold cross validation split. The score is the percentage of predictions within 5 minutes of the acutal value.

>>> dec_pred.make_model(max_depth=3)
>>> cv_score = dec_pred.cv_evalution(thresh=5)
>>> cv_score >= 0 and cv_score <= 100
True

Fit the model on the full training set and evaluate it on the test set.

>>> test_score = dec_pred.fit()
>>> test_score >= 0 and test_score <= 100
True
Parameters:
  • df (dataframe) – the data
  • feat_cols (list) – a list of the names of the feature columns
  • response_col (str) – the name of the response column
cv_evalution(n_folds=10, thresh=5)[source]

Evaluate the model on a cross valdation split of the training data with n_folds nmber of folds. The metric is the percent of predictions within thresh of the true value.

Parameters:
  • n_folds (int) – the number of folds for the cross validation
  • thresh (float) – the threshold for considering a prediction close to the true value
Returns:

the average of metric values over the folds

Return type:

float

fit(thresh=5)[source]

Fit the model on the training set and evaluate it on the test set. The metric is the percent of predictions within thresh of the true value.

Parameters:thresh (float) – the threshold for considering a prediction close to the true value
Returns:the score of the model on the test set
Return type:float
make_int(col)[source]

Encode categorical variables of type other than int as ints for input into the decision tree.

Parameters:col (str) – the name of the column with categorical values
make_model(max_depth=3)[source]

Make the model, a decision tree with maximum depth max_depth.

Parameters:max_depth – the maximum depth of the decision tree
train_test(test_size=0.1)[source]

Split the data into train and test sets.

Parameters:test_size (float) – the percentage of rows to leave out as test
pmareport.predictors.percent_within(y_true, y_pred, thresh=5)[source]

Calculate the percentage of predictions are within thresh of the true value.

Parameters:
  • y_true (array-like) – the true values
  • y_pred (array-like) – the predicted values
  • thresh (float) – the threshold for a close prediction
Returns:

the percent of predictions within the treshold from the true value

Return type:

float

pmareport.predictors.read_data(fp='../data/pmadata.csv')[source]

Read clinic data from a csv into a pandas dataframe.

Parameters:fp (str) – the file path of the csv file

Module contents