The psychometric properties and treatment outcomes associated with two measures of the adult therapeutic alliance using naturalistic data

Daryl Mahon, Takuya Minami, (G.S.) Jeb Brown

Abstract

This is the first of two articles that examine the psychometric properties and treatment outcomes associated with measures of the therapeutic alliance collected in naturalistic settings rather than clinical trials. These articles break new ground in understanding the relationship between measures of alliance and treatment outcome by performing detailed item analyses in order to assess the psychometric properties of each individual item, its propensity to change from session to session and its utility in predicting change in the outcome measure.

Methods

Data were taken from the ACORN database for adults attending for psychotherapy treatment in naturalistic settings (N= 147,399). The sample, the largest to date, included only those completing both an alliance measure and an outcome measure at every session. Two sets of three different alliance items are used across three different treatment populations: general outpatient, substance use, and severe and persistent mental health difficulties. To investigate this possibility, the psychometric properties of each item were evaluated, including factor analysis, likelihood of change over time in treatment and correlations between changes in alliance and magnitude of client-reported improvement in therapy. The predictive validity of each individual item is compared with the predictive validity of their respective three-item measures. The problem of alliance scores being heavily skewed in a positive direction and the resulting lack of variability from session to session is addressed by categorising change in alliance as same, better or worse, regardless of the magnitude of change.

Results:

Findings tended to differ from those in clinical trials, with the last alliance score being most predictive of outcome. In this data set, the alliance accounted for, at most, 2% of the variance in outcomes. Measures of the therapeutic alliance demonstrate ceiling effects, and the alliance–outcome correlation is far from linear. Change in alliance score, rather than a single assessment, is more predictive of outcome, regardless of the magnitude of change in the alliance measure, with effect sizes of up to 50% more for those who rated the alliance as worse than for those rating it as no change or better.

Conclusion:

Therapists using therapeutic alliance questionnaires will benefit from being aware of how the psychometric properties of alliance measures impact outcomes. Even the smallest drop in alliance is predictive of significant clinically meaningful differences. Implications for practice, training and research are considered.

KEYWORDS: alliance measures, alliance psychometric properties, alliance–outcome correlation, therapeutic alliance, therapeutic relationship, working alliance


1  | INTRODUCTION

The origins of the therapeutic alliance can be traced back to Freud's theory on transference (Horvath & Luborsky, 1993), and it is now firmly established as a common factor variable across theoretical modalities. Indeed, it has extended its reach beyond psychotherapy into other allied health and social care disciplines (Flückiger et al., 2018). While Freud's initial ideas were related to the transferential relationship, his thinking on the alliance evolved to describe a beneficial attachment between therapist and client: ‘even the most brilliant results were liable to be suddenly wiped away if my personal relation with the patient was disturbed … the personal emotional relation between doctor and client was after all stronger than the whole cathartic process’ (Freud, 1927, p. 27). There have been a number of authors proposing definitions of the alliance in the extant literature (e.g., Greenson, 1965; Zetzel, 1956).

Rogers' (1967) client-centred therapy provides a comprehensive psychotherapy theory that places the therapeutic alliance at the heart of the model. For Rogers, three core conditions provide the alliance with its prowess. First, empathy is needed by the therapist in order to step into clients' shoes and to understand clients' subjective experiences. Second, unconditional positive regard cultivates a safe and acceptable non-judgmental relationship for treatment to occur. Finally, congruence entails the therapist being open and genuine throughout the therapy process, in relation to the client.

However, the definition most often used is based on a transtheoretical conceptualisation put forward by Bordin (1979), who defined the alliance as the extent of the agreement on the therapeutic goals, consensus on the tasks that make up therapy and a bond between the client and the practitioner. Indeed, it was Bordin who suggested that different therapies would place different demands on the relationship; thus, the ‘profile’ of the ideal working alliance would differ across orientations (Flückiger et al., 2018). That is, some approaches will place more emphasis on specific aspects of the alliance; for example, in contrast to Rogers, cognitive behavioural therapy will have more of a focus on case conceptualisation, collaborative empiricism and Socratic dialogue (Dobson, 2022; Kazantzis et al., 2017).

Indeed, research suggests that better therapy outcomes are associated with therapists who can adjust their approach to clients' sense of the alliance (Tschuschke et al., 2020; Werbart et al., 2018), while conceptual orientation is not associated with alliance ruptures (Tschuschke et al., 2022).

The therapeutic alliance is one of the most studied variables in the psychotherapy literature (Norcross & Lambert, 2019), with well over 1000 studies demonstrating its effectiveness as a common factor. In addition to this, the alliance is a robust predictor of client outcome (Flückiger et al., 2018, 2020; Horvath et al., 2011; Horvath & Bedi, 2002; Horvath & Symonds, 1991; Martin et al., 2000), contributing more variance to client outcomes than the specific treatment method used (Wampold & Imel, 2015).

In their systematic review, Baier et al. (2020) postulate that the alliance mediated clinical outcomes in 70% of the included studies, while Flückiger et al. (2018) found negative correlations in 2% of their meta-analysis studies. The alliance is predictive of client dropout (Cooper et al., 2023) and of poor therapy outcomes (Bolsinger et al., 2020), while alliance scores predict approximately 5%–8% of the variance in treatment outcomes for adults (Flückiger et al., 2018; Horvath et al., 2011). This relationship underscores that attending to the alliance during therapy is a critical skill for practitioners to attain. However, just as therapists vary in their effectiveness (Mahon et al., 2023), therapists also differ in their ability to cultivate and maintain the therapeutic alliance, with the more effective therapists being able to cultivate the alliance with a wider group of clients (Baldwin et al., 2007; Del Re et al., 2012, 2021). Moreover, it is suggested that therapist variability in the alliance is more important than client variability for improving client outcomes (Baldwin et al., 2007; Del Re et al., 2012; Wampold & Flückiger, 2023).

One reason why maintaining the alliance throughout the course of treatment is often difficult is due to ruptures. Eubanks et al. (2018, p. 508) define an alliance rupture as ‘a deterioration in the therapeutic alliance, manifested by a disagreement between the patient and therapist on treatment goals, a lack of collaboration on therapeutic tasks, or a strain in their emotional bond’.

Client characteristics that can impact ruptures, attachment style, personality disorders and motivation for change, are correlated with ruptures (Coutinho et al., 2014; Eames & Roth, 2000). The most recent meta-analysis with a sample of 11 studies with 1314 participants conducted by Eubanks et al. (2018) found a moderate (d= 0.62) effect size between rupture–repair and client outcome. As such, therapists who can tend to ruptures are more likely to have clients who do better in therapy. Flückiger et al.'s (2020) meta-analysis found a reciprocal relationship between the alliance, with improvements in alliance scores in one session being predictive of and associated with reductions in symptoms in subsequent sessions. Other research tracking session-to-session alliance change notes that alliance change predicts subsequent symptom changes (Falkenström, Ekeblad, et al., 2016; Feeley et al., 1999; Wampold & Imel, 2015).

Research has not established the alliance as a causal factor in therapy outcomes (Goldberg et al., 2023). However, we can say that the alliance–outcome correlation is not due to treatment type (Flückiger et al., 2018), client characteristics (Del Re et al., 2012, 2021; Flückiger et al., 2020) nor exclusively due to prior reductions in distress (Falkenström et al., 2013; Flückiger et al., 2020; Zilcha-Mano et al., 2016). Meta-analyses demonstrate that the alliance–outcome correlation is consistent across a range of psychotherapies (Flückiger et al., 2018; Horvath et al., 2011; Horvath & Symonds, 1991; Martin et al., 2000).

Generally speaking, the alliance–outcome correlation is consistent across diagnoses; however, substance use and eating disorders tend to have smaller effect sizes (Flückiger et al., 2018).

Although some research suggests that those with personality disorders demonstrate difficulties in forming the alliance (Forster et al., 2014), a recent meta-analysis by Flückiger et al. (2018) found no difference in the alliance–outcome association with this population. The alliance–outcome association is not due to methodological issues, such as who is rating the alliance, client, therapist and observer (Flückiger et al., 2018), with recent research illustrating that the alliance–outcome association is not due to lower alliance outliers using data in naturalistic settings (Goldberg et al., 2023). These studies are suggestive of the therapeutic alliance as a universal and pan theoretical treatment construct across client, therapist and treatment settings, with the most recent meta-analysis (Flückiger et al., 2018) finding an alliance–outcome correlation of r= .278.


1.1  |  Measuring the alliance

In a review of the use of therapeutic alliance measures, Ardito and Rabellino (2011) suggest that most alliance studies tend to focus on the following measures: the California Psychotherapy Alliance Scale, Helping Alliance Questionnaire, Vanderbilt Psychotherapy Process Scale and the Working Alliance Inventory (WAI). Table 1 provides a breakdown of various alliance measures used in research and practice. However, most measures of the alliance are developed for research purposes, perhaps with the exception of the Session Rating Scale (SRS), which is an ultra-brief clinical session-to-session measure (Duncan et al., 2012).

It is important to note that almost all of these studies contained small sample sizes, and while some may have been replicated in later research, there is a lack of evidence synthesis to help inform decision-making.

Notable is a recent systematic review of the WAI with 66 studies by Paap et al. (2022, p. 1) who found ‘Content validity was rated insufficient because neither patients nor healthcare professionals were involved in the development and validation process’. Hence, evidence for content validity of the WAI is unknown. Conflicting evidence was found for structural validity. Evidence for internal consistency could not be established. Limited evidence was found for inter-rater reliability and convergent validity. Conflicting evidence was also found for test–retest reliability and divergent validity. Most of the measures in Table 1 are based on the conception of the alliance put forward by Bordin (1979), and the psychometric properties are good, especially the coefficient.

Using data from two meta-analyses, Meier and Feeley (2022) suggest that the alliance has a stable ceiling effect on score distributions, and as such, the working alliance does not exhibit the characteristics of a normally distributed variable. Saunders et al. (1989) found that 80% of clients indicated that their session was pretty good, very good, excellent or perfect, while Kim et al. (2001) found 70%–88% of clients gave perfect ratings for most of the items on their measure. These ceiling effects have been demonstrated in other studies (Meier, 2022; Paap et al., 2019; Reese et al., 2013). Although ceiling effects may be indicative of the early establishment of the alliance, others suggest that they could be due to measurement problems and social desirability (Baldwin & Goldberg, 2021; Paap et al., 2019).

The problem of a high degree of skew raises questions about the interpretation of analyses of correlations based on assumptions of distribution somewhat close to normal. Clearly, a distribution in which over 80% of respondents have a perfect rating is not approaching normality. This degree of skew raises questions about the estimate of reliability based on correlational methods, such as coefficient alpha and factor analysis. For example, four studies in Table 1 report the coefficient alpha for different questionnaires, with an item count ranging from four to 34 (Duncan et al., 2012; Kim et al., 2001; Luborsky et al., 1996; Marziali et al., 1981). The estimates of reliability range from .88 to .94, and from .88 to .90 for sample sizes greater than 100. Normally, reliability is expected to vary at least to some degree based on the number of items (Spearman, 1910). However, the Spearman–Brown formula assumes some degree of normality of distribution, and obviously in these examples, this fails, as higher levels of item counts are not associated with greater reliability. Similarly, lack of normality affects not only correlations between alliance scores and outcome measure scores at any session, but also correlations between change in alliance score and outcome scores over time.


1.2  |  Context

There is a dearth of research elucidating how ceiling effects impact the alliance–outcome association, and there is even less research in the literature doing this in naturalistic settings, or comparing such effects across different presenting issues; as such, these articles report on novel data, whose purpose is to explore the psychometric properties of two different sets of three alliance items that have been widely used with adults receiving general outpatient mental health services, those engaged in substance use treatment and those with serious and persistent mental health difficulties. The second article will explore two sets of three items used on questionnaires with youth and their parents/caregivers attending outpatient psychotherapy.

Youth treatment exists in a more heterogeneous environment, and as such, there may be differences in how alliance measures are rated and the resulting outcome association. The research objective guiding both studies is the following: 1. To investigate the variability of therapeutic alliance items and measures across adult and youth populations, and the outcomes associated with such variability.


2  | METHODS

This study used data extracted from the ACORN database, maintained by a small group of IT professionals, psychotherapy researchers, and training content developers associated with the Center for Clinical Informatics. The ACORN clinical information system enables immediate submission of various alliance and outcome questionnaires and immediately displays results to the therapist, along with algorithm-driven clinical messages summarising change over time and drawing attention to risk indicators, such as self-reported substance use, thoughts of self-harm and changes in the alliance scores.


2.1  |  Sample

At the time of this writing, the total ACORN database sample exceeded over 1.2 million episodes of care. For these analyses, cases were selected based on questionnaires completed, and intake scores in the clinical range, and at least two assessments. Adult outcome and alliance questionnaires were administered to those >18 years of age (N= 147,399). While youth outcome and Alliance Questionnaires were administered to those <18, there are no data available to permit breaking out age categories for those <18. Additionally, parents/carers completed (when available in the data) the same version of the Alliance Questionnaire as youths, and this is correlated with youth outcomes.

These data make no distinction between child and adolescent, so we refer to anyone <18 as youth. All clients are assigned an anonymous ID by the clinic so that client personal information is totally protected. Clinics participating in the ACORN collaboration understand that the collaboration analyses and shares results, both within the collaboration and in various research journals and professional publications. All clinic identifiers are likewise kept confidential, known only to the ACORN research staff. In some instances, the clinics have granted permission for specific users outside their organisation to view their data, as might be the case when there are payment incentives to collect outcome data and share results.


2.2  |  Measurement and statistical methodology

2.2.1  |  Outcome questionnaires

The ACORN database processes several versions of the outcome questionnaire that have been developed collaboratively between stakeholders across clinical sites and following engagement through a partnership between academia, healthcare administrators and other key stakeholders (Brown et al., 2015; Lambert et al., 2009). Thus far, these collaborations have resulted in over 200 items with adequate psychometric properties.

Clinical sites can use these different items on outcome questionnaires that best reflect their client populations rather than being restricted to questionnaires that cannot be modified. All variations of the questionnaires are constructed to have a reliability of .9 or better for items loading on the general factor of global distress/well-being. A Global Distress Score (GDS) is calculated for all outcome measurements by taking the average of the item responses.


2.2.2  |  Severity-adjusted effect size

In order to accommodate the multiple versions of the Youth and Adult Outcome Questionnaires, the platform uses a metric for clients' pre-/post improvement, called severity-adjusted effect size (SAES), that is statistically related to Hedge's g (Gaeta & Brydges, 2020). This is carried out by linearly adjusting the GDS of subsequent measurements based on the initial severity (i.e., first GDS) and the diagnostic group when available.

The SAES is then calculated by adjusting the global average pre-/ post-GDS change by the residual, that is, the difference between the linearly predicted change and the actual change.

Since the mean of the residualised change score is always 0, this results in SAES with a mean equivalent to the global average pre-/ postchange but with a distribution of scores around this mean that already accounts for differences in case mix. The range of SAES scores for both adult and youth measures is as follows: less than 0.5 is considered small; between 0.5 and 0.8 is medium; and greater than 0.8 is large.

This statistic is only calculated for cases with an intake score in the clinical range and at least one other assessment in that treatment episode and uses the intake score and diagnosis to adjust for differences in the case mix. This model accounts for an estimated 18% of the variance in change scores. Overall, in the ACORN data warehouse, 76% of episodes are in this range at intake. Removing variance due to differences in case mix increases the accuracy of estimates of clinician and clinic effects (Mahon et al., 2023). An episode is defined as sequential GDS questionnaires with no more than a 120-day gap between any two assessments. If a larger gap is detected, then a new episode is created on the system.



2.2.3  |  Statistical modelling

The SAS implementation of a general linear (Jaccard & Bo, 2023) regression model (PROC GLM) was used to perform case-mix adjustment in order for SAES to account for intake score. The use of the intake score to predict final score or pre-/postchange score is essential to control for regression artefacts (Campbell & Kenny, 1999). While diagnosis was available for less than half of the cases, questionnaires for youth are scored and processed separately than those for adults. Beyond this separation, age was not a significant predictor.



3  |  RESULTS

The following are the names and wording for the two sets of alliance questionnaires that appear together in adult questionnaires. All alliance items' variable names start with SR, an abbreviation for Session Rating consistent with common use among other measures of alliance, or session rating scales (Table 2).


Adult Alliance Questionnaire 1:

• SR1: I felt that we talked about the things that were important to me.

• SR3: I felt my counselor understood me.

• SR5: My therapy was helpful.


Adult Alliance Questionnaire 2:

• SR19: In this program, we discuss things that are really important to me.

• SR20: I feel understood and respected by the therapist.

• SR21: I understand and agree with my treatment plan.



3.1  |  Regression artefacts

The regression analyses revealed that between 44% and 59% of the variance in final GDS was explained by the intake score. Diagnosis or diagnostic groups, when available, explained less than 3% after accounting for intake score.

3.2  |  Item analyses

The first step of the analysis was to determine the distribution of item responses. All items used a scale of 0 to 4, with higher numbers indicating greater concerns about the treatment. The literature review identified a potential problem with client-completed alliance measures; their tendency to be heavily skewed in a positive direction thus violated the underlying assumptions of a normal distribution. This presents challenges when trying to calculate parametric statistics, such as correlations between items and correlation with outcome. Correlations between items will appear magnified, while correlations with the GDS scale and outcome (effect size) will be suppressed.

For this reason, the analyses also explored whether alliance is better treated as a categorical variable rather than as a continuous variable expressed as a scale score. Alliance scale scores, for the purposes of these analyses, are scored as the sum of the three items, resulting in a range of 0 to 12. The following tables present the distribution of item responses and alliance scores at the first assessment and the last assessment in the episode.

The alliance score (sum of the three items) is divided by 3 so that it ranges from 0 to 4, as do the individual items, to be consistent with the single-item scores. For their initial Alliance Questionnaire, over 80% of clients report alliance scores between 0 and 1, with over 70% reporting scores of 0 (i.e., ‘perfect’ alliance). This is evidence of the extremely high degree of skew apparent in these alliance scores. Alliance Questionnaire 2 has significantly less skew, with just over 40% reporting a total score of 0 at the first assessment (Table 3).

Even this level of skew, in which over 40% of responses to the items are 0, or as good as can be, makes it very difficult to justify a Pearson's r, as the distribution strongly violates the underlying assumptions of normality inherent in parametric statistical methods.

Any attempts to calculate patterns of association, regardless of statistical methods (parametric or nonparametric), are still limited by the fact that often over 50% of the responses are 0. There's not much variance there. Table 4 presents results for questionnaires utilised in the treatment of substance use or those with severe and persistent mental disorders. Again, the level of skew is quite apparent.

Even with increased variance in the first alliance score, the degree of correlation between items remains quite high due to the degree of skew.

To illustrate this point, the following are the reliability estimates for each of the measures and populations, estimated using the first assessment. The estimate of reliability on both of these three-item measures using a Cronbach's alpha on records from the first assessment (higher variability) is between .87 and .88 for all questionnaires in all populations. This, of course, is an abnormally high correlation or a three-item measure, as if one asked the same questions with only slightly different wording. This is the result of skew in the responses, and it appears the precise wording of items makes little difference in the reliability. The problem of skew also dramatically reduces the ability to find correlations between alliance scores and changes in the GDSs.

Simple correlations and analysis of variance, using the first and last alliance scores to predict SAES, provide an estimate that alliance accounts for only slightly more than 1% of the variance in SAES. While this appears quite small, when applied at a population level, it is meaningful.

However, the use of simple correlations presents another problem. The correlation between alliance scores or changes in alliance scores with SAES is not linear. More change in alliance does not appear to be more predictive of SAES than very small changes. For this reason, we suggest treating change in alliance as a categorical variable with three categories: no change, alliance better and alliance worse.

Treating alliance change as a categorical variable produced slightly smaller estimates of the percentage of variance of SAES, but is perhaps easier to understand and implement as part of decision-making. It also encourages clinicians to pay attention to very small indicators. An alliance score that changes from 0 to 1 (even on a 12- point scale) is associated with a significantly worse outcome.

Table 5 presents the SAES scores associated with these alliance questionnaires as a function of scores improving, staying the same, or getting worse. Note that over 63% remained unchanged, providing little information. Only 26% showed improved alliance, and less than 11%, worsening alliance.

Table 6 is for Alliance Questionnaire 2, which has more variability at intake, and therefore the potential to provide more information on change. Less than 42% remain unchanged, while over 33% improve and over 25% get worse.

Table 7 illustrates the results for the Substance Use Questionnaire containing Alliance 2 items. In this case, almost 63% of clients show no change in the Alliance scale. While the overall SAES is large, the 11.7% that reported a worse alliance showed dramatically less improvement.

The same three items again produced different results with the severely and persistently ill population. This sample appears more likely to give feedback, with significantly less than 50% reporting no change in alliance. This results in over 39% with improved alliance and almost 20% with worsened alliance scores. Again, any worsening of scores, no matter how small the change is, associated with a much smaller SAES (Table 8).

It is clear from these examples that even small changes in alliance are associated with different outcomes, particularly if the alliance worsened. Greater variance on alliance at first assessment is associated with a greater likelihood of alliance improving at the final assessment. However, while improved alliance is associated with marginally better outcomes, it is the worsened alliance that is the largest risk factor. Due to the nature of the distributions of the alliance items, correlational estimates of variance using alliance scores to predict effect size predict a small percentage of variance in the SAES.

Simple Pearson's r correlations between alliance scores and GDSs at any measurement point, while statistically significant (p< .001), remain low (r< .2). Correlations between alliance scores and SAES are likewise low (Pearson's r< .15).

Table 9 presents the results of the percentage of variance explained (r 2 ) using the first and last alliance scores, the alliance change scores, the first alliance score and the change score, and finally, simply using alliance change coded as better, same or worse.

It is apparent that the alliance scores in this sample, singularly or in combination, predict, at most, 2% of the variance in the measured outcome, and this varies depending on the selection of items and population measured.

4  | DISCUSSION

This study utilised the largest data set to date (N= 147,399) from naturalistic settings to examine the psychometric properties and treatment outcomes associated with two measures of the therapeutic alliance in four populations. The sample far exceeds the previous meta-analysis (N= 30,000) conducted (Flückiger et al., 2018).

Much of what we know about psychotherapy processes and outcomes is obtained from controlled studies and meta-analyses; however, naturalistic data, when available, often have variations to these findings and often use more heterogeneous populations and settings (Wuthrich et al., 2023). These differences are found when comparing empirically supported treatments' effectiveness in randomised controlled trials (RCTs), versus the effectiveness of such treatments in real-world settings (Margison et al., 2000; Sakaluk et al., 2019; Wuthrich et al., 2023).

Similarly, therapist effects are found to be different based on whether the research is conducted in control trials or naturalistic settings (Mahon et al., 2023; Wampold & Imel, 2015). One reason for this is the need for internal validity in trials, and this often means that findings have lower external validity.

The findings in this study revealed that the effectiveness of the therapy provided is comparable with that conducted in tightly controlled RCTs and meta-analyses (Wampold & Imel, 2015).

Yet, these large effect sizes were demonstrated across the four populations in fewer sessions than suggested in RCT dose effect studies (Hansen et al., 2002). One reason for this is that RCTs tend to set out the number of sessions to be provided to participants as part of the study protocol and may not always report data for outcomes early in the research. The data in this study indicate that individuals complete therapy based on reaching a large effect size, and not necessarily based on the duration of therapy or sessions attended, as would be the norm in RCTs. Thus, our findings support previous research on the good-enough level model (Falkenström, Josefsson, et al., 2016; Reese et al., 2011).

Another factor to consider when evaluating naturalistic alliance data is the dynamics that occur when a therapist administers an alliance measure at every session as part of the therapy session versus how it may have been administered in a clinical trial. If the client's attitude when approaching the items is a wish to be polite or not offend the therapist, there is a strong motivation to skew responses. Previous research has suggested that social desirability (Paap et al., 2019) may moderate how alliance measures are scored, although other research did not find this (Reese et al., 2013). As such, the validity of a measure is limited to the therapist's ability to elicit more direct and honest responses to the items. This tendency to provide small clues to any disturbance in the alliance could explain the results observed in the current data set.

The variance attributable to the alliance in this study (2% or less) is somewhat smaller than the 5%–8% found in previous meta analyses (Flückiger et al., 2018; Horvath et al., 2011). The results are in the lower range of studies included in these meta-analyses. Interestingly, Flückiger et al.'s (2018) meta-analysis reports that alliance at the start of therapy is the strongest predictor, while the consistent finding for the current analyses is that the final alliance or alliance change scores account for far more of the limited variance explained than the first alliance.

Another meta-analysis found that the timing of alliance was not a mediator of outcomes (Martin et al., 2000); however, again our findings stand in contrast. One possible reason for these differences is that Flückiger et al. (2018) limited their analysis to subjects with up to seven sessions, while these naturalistic samples were not restricted. Mahon et al. (2021a, 2021b) found that alliance scores were much more predictive of outcomes in therapy lasting five sessions or less than for therapy lasting longer.

Meta-analyses combine data from multiple studies with relatively small (<1000) sample sizes, making it difficult to draw broad generalisations that apply to data collected in naturalistic settings. Furthermore, it is rare for these controlled trial studies to include detailed information of the distribution of alliance scores. A simple report of coefficient alpha and the standard deviation is insufficient and misleading in the absence of an underlying normal distribution. It may be that some alliance measures have shown the ability to produce normal distributions when used in real-world settings, but this is not clear from the evidence available.

The three-item measures in this study were comparable psychometrically with alliance measures with many more items, while also being reflective of other research demonstrating how skewed measures of the alliance are (Kim et al., 2001; Saunders et al., 1989). Of course, this has implications for interpreting the meaning of alliance scores. With ceiling effects such as those found in these data, making sense of how items are rated and associated outcomes becomes problematic. What does seem to be important is any change in scores, regardless of the magnitude, and for this reason, we propose using a categorical approach of better, same or worse as one helpful method of interpreting data from questionnaires.

Our findings reveal that changes in alliance scores are predictive of outcome and that this is most apparent when the alliance is scored as worse. When compared to those who have scored the alliance as better, effect sizes were up to 50% better in some samples. The alliance–outcomes association tended to be similar across all populations, except for general outpatient two questionnaires, in which the effect size associated with a worse drop in alliance was larger than in the other populations. Our data further suggest that the last alliance rating is the most important and predictive of the alliance–outcome correlation. As every session could, theoretically, be the client's last, attending to any ruptures in the alliance is essential.

Prior research from meta-analysis (Eubanks et al., 2018) demonstrates the effectiveness of repairing ruptures; as such, monitoring the alliance at every session and responding to the needs of clients through the renegotiation of goals and tasks or through metacommunication (Eubanks et al., 2018; Safran & Muran, 2000) are essential. Prior research argued that the therapeutic alliance is stable across the course of treatment (Crits-Christoph et al., 2011) and that timing of alliance is not a predictor of the alliance–outcome association (Martin et al., 2000).

On the contrary, other studies examined possible changes from session to session (Rubel et al., 2017; Zilcha-Mano et al., 2016). The data in our study suggest that this way of viewing the alliance is somewhat of a false dichotomy, as both positions are true at the same time. For example, the large percentage of those with no change in alliance would seem to indicate that the majority of those attending therapy are generally happy with their therapist and therapy experience and that the alliance is being established from the first session. This may validate the postulate of Meier and Feeley (2022), who argue that there may be a threshold structure in which clients start to experience the therapeutic relationship as established. However, these data are clear that for those who experience the alliance as worse, there is a significant drop in outcome, and in this study, this represented 11%–25% of clients across the three populations.

4.1  |  Implications for practice, training and research

These results further confirm a broad association between alliance measures and measures of outcome. However, the percentage of variance explained is no more than 2% because of the extreme positive skew in alliance score distribution.

The practical implications of this skew for detecting a signal are potentially profound, as illustrated in these results. Very small changes in the score have significant clinical implications that would be missed entirely if the clinician is not paying close attention to the importance of small changes from one session to the next.

The alliance score may still appear very strong, but it is the small change that is predictive. If therapists and organisations use alliance measures as a tool for improving treatment outcomes, it is critical for clinicians to understand the nature of the measures and how best to employ them in the process of pursuing better outcomes for their clients. Meta-analyses illustrate that a significant percentage of adults (20%–47%) attending therapy drop out early (Swift & Greenberg, 2012; Wierzbicki & Pekarik, 1993) while approximately 5%–10% deteriorate during treatment (Hansen et al., 2002; Lambert, 2013).

There are various methods available to track the alliance and outcome of care on a sessional basis; notably, systems such as the one in this study, commonly referred to as routine outcome monitoring/feedback-informed care, offer therapists reliable measures with data-driven algorithms (Bovendeerd et al., 2022; Brown et al., 2021; de Jong et al., 2021; Delgadillo et al., 2017; Lambert et al., 2018). Where the alliance is scored as worse, the client is providing important information to the therapist, and using a measure will help the therapist to identify and then attend to the alliance ruptures using repair strategies. Regardless of the alliance measure and rating format used, we encourage therapists to use our categorical approach of better, same, or worse to help interpret alliance scores. Repair strategies should seek to identify and assess what component of the alliance has been ruptured; although the alliance is likely more than the sum of its parts, clients may be unhappy with certain aspects of the collaborative bond, goals, or tasks.

However, using measures of the alliance is only a starting point for addressing rupture–repair strategies. Therapists need to be able to actively engage clients in difficult conversations, and at times, this may involve resolution with angry clients. On this note, research suggests that those therapists who have well-defined facilitative interpersonal skills when faced with challenging encounters have better outcomes and can maintain a more effective alliance (Anderson et al., 2016; Anderson & Perlman, 2022; Heinonen & NissenLie, 2020; Schöttke et al., 2017; Wampold & Imel, 2015). Therapists who wish to improve these interpersonal skills for alliance maintenance can engage in deliberate practice (Chow et al., 2015; Goldberg et al., 2016; Mahon, 2022; Rousmaniere, 2016). Using client stimulus training videos to practice challenging encounters at the edge of one's ability and refining responses based on feedback is one deliberate practice method therapists will find helpful (Mahon, 2022).

These results also point to the importance of paying attention to the psychometric properties of individual items when researchers are developing alliance measures. The actual distribution of the item, as measured by the use within the intended population, is as important, or perhaps more important, than whether the item conforms to a pre-existing belief of what factors constitute alliance. Small differences in the distribution of items can have a significant impact on the item's sensitivity to change over time. Future research with more naturalistic data is encouraged to replicate these findings. On this point, therapists, organisations and policymakers should be conscious of how the results from control trials may not be reflective of real-world therapy settings, and perhaps it is better to rely on measurement-based care feedback systems to establish how effective their practice is.

4.2  |  Limitations

While a strength of this study is the size of the sample, the largest to date, also in a naturalistic setting, this does mean that the findings are not based on experiential design, and as such, causality cannot be established. Data from naturalistic settings do have limitations compared with data from well-controlled studies. There was likely variation in how measures were administered and interpreted across various clinics, and clinicians may have influenced the test scores in unknown ways. Furthermore, the data in this sample are drawn from a variety of outpatient settings, and we are unable to determine whether there are diagnoses or other factors that could hinder how effects in this data set are interpreted. In addition, there is known and unknown variability in how the alliance measure is administered, instructions given to clients, clients' understanding of the reasons for questionnaires, nature of the treatment interventions, and differences in setting and populations. Finally, the data in this study are taken from a large system that uses feedback to improve clinical outcomes and the therapeutic alliance. As such, the findings, while comparable with those from decades of meta analyses on psychotherapy effectiveness, may not be reflective of those in other naturalistic settings not using such quality improvement processes.

5  |  CONCLUSION

The therapeutic alliance is an essential component of successful psychotherapy, adding 2% of the variance in outcome. However, measures of the alliance are highly skewed, and this must be considered when developing a questionnaire, and when using alliance measures in clinical practice. The methods used to score alliance questionnaires do not seem to be important, but seeking any variance in scores, even the smallest change, is correlated with changes in outcome. This is most magnified when the alliance is scored as worse by the client.

The magnitude of change in clinical outcome associated with alliance change is not linear, and as such, we propose a categorical method of interpreting scores using better, same or worse. Some populations seem to be more willing to provide feedback, and therapists need to be conscious of how to interpret this and its implications of outcomes. Naturalistic data do not always reflect clinical trials, and this was apparent in the present study in the outcomes achieved in shorter time frames, in the difference in the percentage of variance attributable to the alliance and in the last alliance rating being the most predictive.

CONFLICT OF INTEREST STATEMENT

The authors Brown and Minami are associated with the consulting firm (Center for Clinical Informatics) that supports the ACORN collaboration. Both were involved in the formation of the collaboration in 2008.

Ashley Simon