Introduction: Multiple scales to evaluate breast cosmesis following breast conserving treatment (BCT) have been developed, however reliability is a problem. Panel scores, where scores from two or more individuals are combined, were assessed to examine their effect on reliability for two different cosmetic scales.
Methods: Women, two or more years following BCT, were recruited from a single breast centre. Photographs of each participant were evaluated independently by six health care professionals on two separate occasions. A simple four-point scale and more involved multi-item scale were used to assess cosmetic outcome. Reliability was assessed with the weighted kappa statistic for increasing panel sizes.
Results: Ninety-nine women were evaluated. Intra rater reliability increased from 0.73 to 0.83 for the four-point scale, for increasing panel sizes, however 95% confidence intervals generally overlapped. A smaller and more unpredictable effect was seen on the multi-item subscale, range 0.69 to 0.73. Inter rater reliability increased from 0.68 to 0.93 for the four-point scale, and 0.75 to 0.96 for the multi-item scale, for increasing panel sizes; 95% confidence intervals did not overlap. A panel of three for either scale provided almost perfect kappa values with only small improvements with larger panel sizes.
Conclusions: Care should be used in interpreting results where cosmetic outcomes have been obtained from a single evaluator. Panel scores can be used to significantly improve inter-rater, but not intra rater reliability, for the scales studied. Comparable reliability, in combination with simplicity of use and interpretation, would favour the four-point scale for breast cosmetic evaluation over the multi-item scale.