During my PhD in Biomedical Statistics, I focused my research on the study of ordinal classification tree methodology, to be applied in case of ordinal categorical outcome, and all the statistical and computational methods related to them. Then, I provided an application on the set of controls of Italian case-control studies, in order to identify the profiles of individuals according to their total energy intake, and to their consumption of red meat and processed meat. Introduction to classification trees Tree-based methods are non parametric regression methods belonging to a group of techniques called “recursive partitioning”. They recursively partition the feature space, which includes all the predictors, into a set of nested rectangular areas. The main objective of this technique is to obtain subgroups of observations (nodes) which should be more homogeneous as possible in terms of the values of the response variable. A quantitative measure of the extent of a node homogeneity is the notion of node purity, with a completely pure node having all the observations in it belong to the same category of the outcome. When the response to be predicted is ordinal, an ordinal impurity measure is usually preferred. The most frequently used ordinal impurity function is the generalized Gini function. The split, among all the possible binary splits for a given node, resulting in the largest value of the decrease in node impurity, is selected. The process of splitting continues in each node until some stop condition is reached, and a large tree T0 is built. However, a very large tree may overfit data. Overfitting refers to the fact that a classifier adapts too closely to the training dataset, leading to poor test performance when applied to the validation set. Thus, a common strategy is to eliminate those parts of a classifier that are likely to overfit the training data. This process is called pruning, and consists in eliminating branches that do not add information to prediction accuracy. Classification tree analysis uses a cost-complexity pruning. This approach balances the complexity (i.e., the number of predictors and terminal nodes) of the sub-tree and the overall misclassification rate. Two predictive performance measures to be used when Y is an ordinal outcome are the total number of misclassified observations R_mr (T) and the total misclassification cost R_mc (T). With the first one, the class of assignment for observations within each node is usually the modal class of the outcome, while with the second one it is the median class. With classification trees we are able to examine complex interactions among risk factors that do not need to be pre-specified a priori. Moreover, we will likely be able to identify the most important risk (or protective) factors among various predictors, and we have the possibility to identify ideal cut-offs of continuous variables, according to some pre-specified criteria. On the other hand, since this is a data-driven method, a drawback is that small changes in the data can result in a very different series of splits, making interpretation instable. Application to real data To conduct the analyses on ordinal classification tree method, I used the set of controls (n=7750) of various case-control studies conducted in six Italian provinces between 1991 and 2008. Controls were individuals with no history of cancer admitted to the same hospitals of cases for acute, non-neoplastic conditions, unrelated to diseases or to conditions linked to the cancer in study. Predictors were food groups, related to the subjects' dietary habits during the 2 years before hospitalization, assessed through a validated and reproducible food frequency questionnaire, which included information on weekly consumption of 78 foods and beverages. Two different types of analyses were performed to evaluate the performance of classification trees methodology in predicting the category of total energy intake (kcal/day): single tree analysis and resampling analysis (B=100, to overcome sampling variability). I compared five different scenarios, four in the context of ordinal classification trees (generalized Gini impurity function) and one in the context of nominal classification trees (Gini impurity measure). In the ordinal context, each scenario was a combination of the splitting function (absolute or quadratic misclassification cost) and the predictive performance measure (misclassification error rate/mode or misclassification cost/median). Also classification trees to predict the daily consumption (grams) of red meat and processed meat was performed. Results and Discussion The most important predictor for energy intake was bread consumption. Indeed, this predictor resulted as the first split in each of the five scenarios, with a threshold of 16.4 portions/week. Other predictors common to all the five scenarios were desserts and red meat intake. The comparison between five different methods put in evidence that, in case of ordinal outcome, adequate ordinal methods should be preferred. According to the prediction accuracy between various ordinal models, it emerged that models with quadratic misclassification cost had better predictive power, in particular when median was used to assign outcome classes. A good predictive performance was also observed with quadratic misclassification cost and modal values. This findings were consistent both in the single-tree and in the resampling analysis. In the single-tree analysis, the values of Somers’ d measure ranged between 0.489 and 0.534. In the resampling analysis, Friedman’s test rejected the global equality hypothesis across the five models (p<0.001). In the application on red meat and processed meat intake, it emerged that important predictors for red meat consumption were total intake of sweets (first split) and bread consumption, while important predictors for processed meat intake were the consumption of eggs, bread and sweets. In particular, subjects eating less than 1 egg per week and less than 2 portions of bread per day were classified as having small consumption (<25 g/day) of processed meat. On the other hand, individuals eating more than 1 egg per week and more than 30 portions of sweets per week were predicted to have a great (≥50 g/day) consumption of processed meat. Possible future researches should try to take advantage of findings obtained with classification tree methodologies in order to investigate the relationship between red meat and processed meat intake and the risk of colorectal cancer and the risk of other neoplasms. Moreover, the application of recursive partitioning techniques in predictive settings, including data on cancer screening, may be of interest for future researches.
ORDINAL CLASSIFICATION TREES: METHODS AND APPLICATION
LUGO, ALESSANDRA
2015
Abstract
During my PhD in Biomedical Statistics, I focused my research on the study of ordinal classification tree methodology, to be applied in case of ordinal categorical outcome, and all the statistical and computational methods related to them. Then, I provided an application on the set of controls of Italian case-control studies, in order to identify the profiles of individuals according to their total energy intake, and to their consumption of red meat and processed meat. Introduction to classification trees Tree-based methods are non parametric regression methods belonging to a group of techniques called “recursive partitioning”. They recursively partition the feature space, which includes all the predictors, into a set of nested rectangular areas. The main objective of this technique is to obtain subgroups of observations (nodes) which should be more homogeneous as possible in terms of the values of the response variable. A quantitative measure of the extent of a node homogeneity is the notion of node purity, with a completely pure node having all the observations in it belong to the same category of the outcome. When the response to be predicted is ordinal, an ordinal impurity measure is usually preferred. The most frequently used ordinal impurity function is the generalized Gini function. The split, among all the possible binary splits for a given node, resulting in the largest value of the decrease in node impurity, is selected. The process of splitting continues in each node until some stop condition is reached, and a large tree T0 is built. However, a very large tree may overfit data. Overfitting refers to the fact that a classifier adapts too closely to the training dataset, leading to poor test performance when applied to the validation set. Thus, a common strategy is to eliminate those parts of a classifier that are likely to overfit the training data. This process is called pruning, and consists in eliminating branches that do not add information to prediction accuracy. Classification tree analysis uses a cost-complexity pruning. This approach balances the complexity (i.e., the number of predictors and terminal nodes) of the sub-tree and the overall misclassification rate. Two predictive performance measures to be used when Y is an ordinal outcome are the total number of misclassified observations R_mr (T) and the total misclassification cost R_mc (T). With the first one, the class of assignment for observations within each node is usually the modal class of the outcome, while with the second one it is the median class. With classification trees we are able to examine complex interactions among risk factors that do not need to be pre-specified a priori. Moreover, we will likely be able to identify the most important risk (or protective) factors among various predictors, and we have the possibility to identify ideal cut-offs of continuous variables, according to some pre-specified criteria. On the other hand, since this is a data-driven method, a drawback is that small changes in the data can result in a very different series of splits, making interpretation instable. Application to real data To conduct the analyses on ordinal classification tree method, I used the set of controls (n=7750) of various case-control studies conducted in six Italian provinces between 1991 and 2008. Controls were individuals with no history of cancer admitted to the same hospitals of cases for acute, non-neoplastic conditions, unrelated to diseases or to conditions linked to the cancer in study. Predictors were food groups, related to the subjects' dietary habits during the 2 years before hospitalization, assessed through a validated and reproducible food frequency questionnaire, which included information on weekly consumption of 78 foods and beverages. Two different types of analyses were performed to evaluate the performance of classification trees methodology in predicting the category of total energy intake (kcal/day): single tree analysis and resampling analysis (B=100, to overcome sampling variability). I compared five different scenarios, four in the context of ordinal classification trees (generalized Gini impurity function) and one in the context of nominal classification trees (Gini impurity measure). In the ordinal context, each scenario was a combination of the splitting function (absolute or quadratic misclassification cost) and the predictive performance measure (misclassification error rate/mode or misclassification cost/median). Also classification trees to predict the daily consumption (grams) of red meat and processed meat was performed. Results and Discussion The most important predictor for energy intake was bread consumption. Indeed, this predictor resulted as the first split in each of the five scenarios, with a threshold of 16.4 portions/week. Other predictors common to all the five scenarios were desserts and red meat intake. The comparison between five different methods put in evidence that, in case of ordinal outcome, adequate ordinal methods should be preferred. According to the prediction accuracy between various ordinal models, it emerged that models with quadratic misclassification cost had better predictive power, in particular when median was used to assign outcome classes. A good predictive performance was also observed with quadratic misclassification cost and modal values. This findings were consistent both in the single-tree and in the resampling analysis. In the single-tree analysis, the values of Somers’ d measure ranged between 0.489 and 0.534. In the resampling analysis, Friedman’s test rejected the global equality hypothesis across the five models (p<0.001). In the application on red meat and processed meat intake, it emerged that important predictors for red meat consumption were total intake of sweets (first split) and bread consumption, while important predictors for processed meat intake were the consumption of eggs, bread and sweets. In particular, subjects eating less than 1 egg per week and less than 2 portions of bread per day were classified as having small consumption (<25 g/day) of processed meat. On the other hand, individuals eating more than 1 egg per week and more than 30 portions of sweets per week were predicted to have a great (≥50 g/day) consumption of processed meat. Possible future researches should try to take advantage of findings obtained with classification tree methodologies in order to investigate the relationship between red meat and processed meat intake and the risk of colorectal cancer and the risk of other neoplasms. Moreover, the application of recursive partitioning techniques in predictive settings, including data on cancer screening, may be of interest for future researches.File | Dimensione | Formato | |
---|---|---|---|
phd_unimi_R10100.pdf
Open Access dal 10/03/2016
Dimensione
2.28 MB
Formato
Adobe PDF
|
2.28 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/78969
URN:NBN:IT:UNIMI-78969