pmareport package¶

Submodules¶

pmareport.predictors module¶

The model used to predict appointment duration is a decision tree. The model is evaluated by the precentage of predicted times that are within a threshold (5 minutes by default) of the actual duration.

The class DurationPredictor splits the data into testing and training, builds the model (using scikit-learn’s implementation of decision tree) and evaluates the model both on a cross validation split of the training set and on the test set.

DurationPredictor also includes functionality to turn non-integer categorical features into ints, which scikit-learn’s decision tree implementation requires.

class pmareport.predictors.DurationPredictor(df, feat_cols, response_col)[source]¶

Bases: object

A model to predict the duration of an appointment.

For example, let’s make a dataframe with random data in columns feat1 and response.

>>> df = pd.DataFrame(np.random.randn(30,2), columns=['feat1', 'response'])

We add a column feat2 with categorical values (‘a’ or ‘b’).

>>> df['feat2'] = np.random.choice(['a', 'b'], 30)

Let’s make a DurationPredictor object from our example dataframe.

>>> dec_pred = DurationPredictor(
...     df=df,
...     feat_cols=['feat1', 'feat2'],
...     response_col='response'
...     )

To turn feat2 into a column of ints (which scikit-learn’s decision tree implementation requires), we use make_int.

>>> dec_pred.make_int(col='feat2')

We split our data set into train and test with 10% left out to test.

>>> dec_pred.train_test(test_size=0.1)

Now let’s make the model, a decision tree of maximum depth 3, and get its average score on a 10-fold cross validation split. The score is the percentage of predictions within 5 minutes of the acutal value.

>>> dec_pred.make_model(max_depth=3)
>>> cv_score = dec_pred.cv_evalution(thresh=5)
>>> cv_score >= 0 and cv_score <= 100
True

Fit the model on the full training set and evaluate it on the test set.

>>> test_score = dec_pred.fit()
>>> test_score >= 0 and test_score <= 100
True

Parameters:	df (dataframe) – the data feat_cols (list) – a list of the names of the feature columns response_col (str) – the name of the response column

cv_evalution(n_folds=10, thresh=5)[source]¶

Evaluate the model on a cross valdation split of the training data with n_folds nmber of folds. The metric is the percent of predictions within thresh of the true value.

Parameters:	n_folds (int) – the number of folds for the cross validation thresh (float) – the threshold for considering a prediction close to the true value
Returns:	the average of metric values over the folds
Return type:	float

fit(thresh=5)[source]¶

Fit the model on the training set and evaluate it on the test set. The metric is the percent of predictions within thresh of the true value.

Parameters:	thresh (float) – the threshold for considering a prediction close to the true value
Returns:	the score of the model on the test set
Return type:	float

make_int(col)[source]¶

Encode categorical variables of type other than int as ints for input into the decision tree.

Parameters:	col (str) – the name of the column with categorical values

make_model(max_depth=3)[source]¶

Make the model, a decision tree with maximum depth max_depth.

Parameters:	max_depth – the maximum depth of the decision tree

train_test(test_size=0.1)[source]¶

Split the data into train and test sets.

Parameters:	test_size (float) – the percentage of rows to leave out as test

pmareport.predictors.percent_within(y_true, y_pred, thresh=5)[source]¶

Calculate the percentage of predictions are within thresh of the true value.

Parameters:	y_true (array-like) – the true values y_pred (array-like) – the predicted values thresh (float) – the threshold for a close prediction
Returns:	the percent of predictions within the treshold from the true value
Return type:	float

pmareport.predictors.read_data(fp='../data/pmadata.csv')[source]¶

Read clinic data from a csv into a pandas dataframe.

Parameters:	fp (str) – the file path of the csv file

pmareport package¶

Submodules¶

pmareport.predictors module¶

Module contents¶