Causal Models

From Gaskination StatWiki
Revision as of 18:27, 4 December 2022 by Gaskination (talk | contribs) (Protected "Causal Models" ([Edit=Allow only administrators] (indefinite) [Move=Allow only administrators] (indefinite)) [cascading])
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Take the online course on MyEducator: SEM Online 3 credit Graduate Course Invitation Video

“Structural equation modeling (SEM) grows out of and serves purposes similar to multiple regression, but in a more powerful way which takes into account the modeling of interactions, nonlinearities, correlated independents, measurement error, correlated error terms, multiple latent independents each measured by multiple indicators, and one or more latent dependents also each with multiple indicators. SEM may be used as a more powerful alternative to multiple regression, path analysis, factor analysis, time series analysis, and analysis of covariance. That is, these procedures may be seen as special cases of SEM, or, to put it another way, SEM is an extension of the general linear model (GLM) of which multiple regression is a part.“

SEM is an umbrella concept for complex measurement models and structural models. A more common name for structural models is "causal models". This wiki page provides general instruction and guidance regarding how to write hypotheses for different types of causal model relationships, what to do with control variables, mediation, interaction, multi-group analyses, and model fit for causal models. Videos and slides presentations are provided in the subsections.

Citation.pngDo you know of some citations that could be used to support the topics and procedures discussed in this section? Please email them to me with the name of the section, procedure, or subsection that they support. Thanks!


Hypotheses are a keystone to causal theory. However, wording hypotheses is clearly a struggle for many researchers (just select at random any article from a good academic journal, and count the wording issues!). In this section I offer examples of how you might word different types of hypotheses. These examples are not exhaustive, but they are safe.

Direct effects

"Diet has a positive effect on weight loss"

"An increase in hours spent watching television will negatively effect weight loss"

Mediated effects

"Exercise mediates the positive relationship between diet and weight loss"

"Television time mediates the positive relationship between diet and weight loss"

"Diet affects weight loss indirectly through exercise"

Interaction effects

"Exercise strengthens the positive relationship between diet and weight loss"

"Exercise amplifies the positive relationship between diet and weight loss"

"TV time dampens the positive relationship between diet and weight loss"

Multi-group effects

"The relationship between X and Y is stronger for Group A."

"Body Mass Index (BMI) moderates the relationship between exercise and weight loss, such that for those with a low BMI, the effect is negative (i.e., you gain weight - muscle mass), and for those with a high BMI, the effect is positive (i.e., exercising leads to weight loss)"

"Age moderates the relationship between exercise and weight loss, such that for age < 40, the positive effect is stronger than for age > 40"

"Diet moderates the relationship between exercise and weight loss, such that for western diets the effect is positive and weak, for eastern (asia) diets, the effect is positive and strong"

Mediated Moderation

An example of a mediated moderation hypothesis would be something like:

“Ethical concerns strengthen the negative indirect effect (through burnout) between customer rejection and job satisfaction.”

In this case, the IV is customer rejection, the DV is job satisfaction, burnout is the mediator, and the moderator is ethical concerns. The moderation is conducted through an interaction. However, if you have a categorical moderator, it would be something more like this (using gender as the moderator):

“The negative indirect effect between customer rejection and job satisfaction (through burnout) is stronger for men than for women.”

Handling controls

When including controls in hypotheses (yes, you should include them), simply add at the end of any hypothesis, "when controlling for...[list control variables here]" For example:

"Exercise positively moderates the positive relationship between diet and weight loss when controlling for TV time and diet"

"Diet has a positive effect on weight loss when controlling for TV time and diet"

Another approach is to state somewhere above your hypotheses (while you're setting up your theory) that all your hypotheses take into account the effects of the following controls: A, B, and C. And then make sure to explain why.

Logical Support for Hypotheses

Getting the wording right is only part of the battle, and is mostly useless if you cannot support your reasoning for WHY you think the relationships proposed in the hypotheses should exist. Simply saying X has a positive effect on Y is not sufficient to make a causal statement. You must then go an explain the various reasons behind your hypothesized relationship. Take Diet and Weight loss for example. The hypothesis is, "Diet has a positive effect on weight loss". The supporting logic would then be something like:

  • Weight is gained as we consume calories. Diet reduces the number of calories consumed. Therefore, the more we diet, the more weight we should lose (or the less weight we should gain).

Statistical Support for Hypotheses through global and local tests

In order for a hypothesis to be supported, many criteria must be met. These criteria can be classified as global or local tests. In order for a hypothesis to be supported, the local test must be met, but in order for a local test to have meaning, all global tests must be met. Global tests of model fit are the first necessity. If a hypothesized relationship has a significant p-value, but the model has poor fit, we cannot have confidence in that p-value. Next is the global test of variance explained or R-squared. We might observe significant p-values and good model fit, but if R-square is only 0.025, then the relationships we are testing are not very meaningful because they do not explain sufficient variance in the dependent variable. The figure below illustrates the precedence of global and local tests. Lastly, and almost needless to explain, if a regression weight is significant, but is in the wrong direction, our hypothesis is not supported. Instead, there is counter-evidence. For example, if we theorized that exercise would increase weight loss, but instead, exercise decreased weight loss, then we would have counter-evidence.



Controls are potentially confounding variables that we need to account for, but that don’t drive our theory. For example, in Dietz and Gortmaker 1985, their theory was that TV time had a negative effect on school performance. But there are many things that could affect school performance, possibly even more than the amount of time spent in front of the TV. So, in order to account for these other potentially confounding variables, the authors control for them. They are basically saying, that regardless of IQ, time spent reading for pleasure, hours spent doing homework, or the amount of time parents spend reading to their child, an increase in TV time still significantly decreases school performance. These relationships are shown in the figure below.


As a cautionary note, you should nearly always include some controls; however, these control variables still count against your sample size calculations. So, the more controls you have, the higher your sample size needs to be. Also you get a higher R square but with increasingly smaller gains for each added control. Sometimes you may even find that adding a control “drowns out” all the effects of the IV’s, in such a case you may need to run your tests without that control variable (but then you can only say that your IVs, though significant, only account for a small amount of the variance in the DV). With that in mind, you can’t and shouldn't control for everything, and as always, your decision to include or exclude controls should be based on theory.

Handling controls in AMOS is easy, but messy (see the figure below). You simply treat them like the other exogenous variables (the ones that don’t have arrows going into them), and have them regress on whichever endogenous variables they may logically affect. In this case, I have valShort, a potentially confounding variable, as a control, with regards to valLong. And I have LoyRepeat as a control on LoyLong. I’ve also covaried the Controls with each other and with the other exogenous variables. When using controls in a moderated mediation analysis, go ahead and put the controls in at the very beginning. Covarying control variables with the other exogenous variables can be done based on theory, rather than as default. However, there are different schools of thought on this. The downside of covarying with all exogenous variables is that you gain no degrees of freedom. If you are in need of degrees of freedom, then try removing the non-significant covariances with controls.


When reporting the model, you do need to include the controls in all your tests and output, but you should consolidate them at the bottom where they can be out of the way. Also, just so you don’t get any crazy ideas, you would not test for any mediation between a control and a dependent variable. However, you may report how the control effects a dependent variable differently based on a moderating variable. For example, valshort may have a stronger effect on valLong for males than for females. This is something that should be reported, but not necessarily focused on, as it is not likely a key part of your theory. Lastly, even if effects from controls are not significant, you do not need trim them from your model (although, there are also other schools of thought on this issue).



Mediation models are used to describe chains of causation. Mediation is often used to provide a more accurate explanation for the causal effect the antecedent has on the dependent variable. The mediator is usually that variable that is the missing link in a chain of causation. For example, Intelligence leads to increased performance - but not in all cases, as not all intelligent people are high performers. Thus, some other variable is needed to explain the reason for the inconsistent relationship between IV and DV. This other variable is called a mediator. In this example, work effectiveness, may be a good mediator. We would say that work effectiveness mediates the relationship between intelligence and performance. Thus, the direct relationship between intelligence and performance is better explained through the mediator of work effectiveness. The logic is, intelligent workers tend to perform better because they work more efficiently. Thus, when intelligence leads to working smarter, then we observe greater performance.


We used to theorize three main types of mediation based on the Barron and Kenny approach; namely: 1) partial, 2) full, and 3) indirect. However, recent literature suggests that mediation is less nuanced than this -- that simply, if a significant indirect effect exists, then mediation is present.

Here is another useful site for mediation:



In factorial designs, interaction effects are the joint effects of two predictor variables in addition to the individual main effects. This is another form of moderation (along with multi-grouping) – i.e., the X to Y relationship changes form (gets stronger, weaker, changes signs) depending on the value of another explanatory variable (the moderator). So, for example

  • you lose 1 pound of weight for every hour you exercise
  • you lose 1 pound of weight for every 500 calories you cut back from your regular diet
  • but when you exercise while dieting, the you lose 2 pounds for every 500 calories you cut back from your regular diet, in addition to the 1 pound you lose for exercising for one hour; thus in total, you lose three pounds

So, the multiplicative effect of exercising while dieting is greater than the additive effects of doing one or the other. Here is another simple example:

  • Chocolate is yummy
  • Cheese is yummy
  • but combining chocolate and cheese is yucky!

The following figure is an example of a simple interaction model.



Interactions enable more precise explanation of causal effects by providing a method for explaining not only how X affects Y, but also under what circumstances the effect of X changes depending on the moderating variable of Z. Interpreting interactions is somewhat tricky. Interactions should be plotted (as demonstrated in the tutorial video). Once plotted, the interpretation can be made using the following four examples (in the figures below) as a guide. My most recent Stats Tools Package provides these interpretations automatically.


Model fit again

You already did model fit in your CFA, but you need to do it again in your structural model in order to demonstrate sufficient exploration of alternative models. Every time the model changes and a hypothesis is tested, model fit must be assessed. If multiple hypotheses are tested on the same model, model fit will not change, so it only needs to be addressed once for that set of hypotheses. The method for assessing model fit in a causal model is the same as for a measurement model: look at modification indices, residuals, and standard fit measures like CFI, RMSEA etc. The one thing that should be noted here in particular, however, is logic that should determine how you apply the modification indices to error terms. Also, a warning that some argue there is never an appropriate argument for covarying error terms. (I tend to agree that they should not be covaried.)

  • If the correlated variables are not logically causally correlated, but merely statistically correlated, then you may covary the error terms in order to account for the systematic statistical correlations without implying a causal relationship.
    • e.g., burnout from customers is highly correlated with burnout from management
    • We expect these to have similar values (residuals) because they are logically similar and have similar wording in our survey, but they do not necessarily have any causal ties.
  • If the correlated variables are logically causally correlated, then simply add a regression line.
    • e.g., burnout from customers is highly correlated with satisfaction with customers
    • We expect burnC to predict satC, so not accounting for it is negligent.

Lastly, remember, you don't need to create the BEST fit, just good fit. If a BEST fit model (i.e., one in which all modification indices are addressed) isn't logical, or does not fit with your theory, you may need to simply settle for a model that has worse (yet sufficient) fit, and then explain why you did not choose the better fitting model. For more information on when it is okay to covary error terms (because there are other appropriate reasons), refer to David Kenny's thoughts on the matter: David's website


Multi-group comparisons are a special form of moderation in which a dataset is split along values of a grouping variable (such as gender), and then a given model is tested with each set of data. Using the gender example, the model is tested for males and females separately. The use of multi-group comparisons is to determine if relationships hypothesized in a model will differ based on the value of the moderator (e.g., gender). Take the diet and weight loss hypothesis for example. A multi-group analysis would answer the question: does dieting effect weight loss differently for males than for females? In the videos above, you will learn how to set up a multigroup analysis in AMOS, and test it using chi-square differences, and AMOS's built in multigroup function. For those who have seen my video on the critical ratios approach, be warned that currently, the chi-square approach is the most widely accepted because the critical ratios approach doesn't take into account family-wise error which affects a model when testing multiple hypotheses simultaneously. For now, I recommend using the chi-square approach. The AMOS built in multigroup function uses the chi-square approach as well.

MGA for more than two groups

No SEM software currently allows the comparison of more than two groups at a time. The primary reason for this is the human limitation of only being able to compare pairs. First-gen methods of comparison allow pairwise comparisons (i.e., each group compared to every other group in pairs), such as tukey's or bonferroni's post-hoc tests in an ANOVA. So, if you have more than two groups, you'll need to compare them in pairs as well. Let's take the example of marital status (simplified here for ease of teaching): single, married, divorced. You could compare these in pairs:

  • Single vs Married
  • Single vs Divorced
  • Married vs Divorced

To do this in AMOS, SmartPLS, or Mplus, you would just specify groups by the value of the marital status variable (e.g., single=0, married=1, divorced=2). So, if you were comparing Single vs Married, you would set group 1 to use only data from your dataset where the marital status variable = 0, and group 2 would use only data where marital status = 1).

Another approach is to compare groups as one vs all other:

  • Single vs not Single
  • Married vs not Married
  • Divorced vs not Divorced

To do this, you would need to create three new dummy variables in your dataset to represent whether someone was single, married, or divorced. In all cases, the row would receive a value of 1 for that variable if the condition was met (i.e., if single, then the dummy variable for single would get a 1, but the dummies for married and divorced would get a zero). Then, in your MGA, if you were testing single vs not single, you would specify group 1 as using data only if the single dummy = 0 (not single) and group 2 as using data only if the single dummy = 1 (single).

From Measurement Model to Structural Model

Many of the examples in the videos so far have taught concepts using a set of composite variables (instead of latent factors with observed items). Many will want to utilize the full power of SEM by building true structural models (with latent factors). This is not a difficult thing. Simply remove the covariance arrows from your measurement model (after CFA), then draw single-headed arrows from IVs to DVs. Make sure you put error terms on the DVs, then run it. It's that easy. Refer to the video for a demonstration.

Creating Factor Scores from Latent Factors

If you would like to create factor scores (as used in many of the videos) from latent factors, it is an easy thing to do. However, you must remember two very important caveats:

  • You are not allowed to have any missing values in the data used. These will need to be imputed beforehand in SPSS or Excel (I have two tools for this in my Stats Tools Package - one for imputing, and one for simply removing the entire row that has missing data).
  • Latent factor names must not have any spaces or hard returns in them. They must be single continuous strings ("FactorOne" or "Factor_One" instead of "Factor One").

After those two caveats are addressed, then you can simply go to the Analyze menu, and select Data Imputation. Select Regression Imputation, and then click on the Impute button. This will create a new SPSS dataset with the same name as the current dataset except it will be followed by an "_C". This can be found in the same folder as your current dataset.

Need more degrees of freedom

Did you run your model and observe that DF=0 or CFI=1.000. Sounds like you need more degrees of freedom. There are a few ways to do this:

  1. If there are opportunities to use latent variables instead of computed variables, use latents.
  2. If you have control variables, do not link them to every other variable.
  3. Do not include all paths by default. Just include the ones that make good theoretical sense.
  4. If a path is not significant, omit it. If you do this, make sure to argue that the reason for doing this was to increase degrees of freedom (and also because the path was not significant).

Increasing the degrees of freedom allows AMOS to calculate model fit measures. If you have zero degrees of freedom, model fit is irrelevant because you are "perfectly" accounting for all possible relationships in the model.

Sample Size

Sample size is an important consideration for assumptions of statistical power. Statistical power is the assumption that your statistical model is "powerful" enough to detect significant effects that actually do exist. In other words, can your model prevent false negatives (erroneously failing to reject the null hypothesis)? Because critical values (and therefore p-values) are calculated based on minimization of error, we must have at least a minimum satisfactory level of sample size to keep error below a maximum acceptable level of error. If error is beyond that maximum threshold, then we will be more likely to find false negatives because our confidence intervals will be too wide and consequently will be more likely to span zero - thereby leading to a failure to reject the null. The way to keep error below that maximum threshold is to ensure we have an adequate sample size to start with. There are many published recommendations on a priori sample size calculation. All of them are just estimations. What it really boils down to is the minimization of error. So, if your sample population is a good set of respondents, with thoughtful and consistent responses to your survey (assuming you are using a survey), then there will be less error than if you are sampling on a service like mTurk.

In my experience, having designed and advised on several dozen quantitative projects, the following sample size formula has worked pretty well: 50+5x. In this formula, x represents the number of observed variables (e.g., survey questions) you are measuring. So, if your survey contains 65 questions, then you would need 50+65*5=375 responses. This would be sufficient to confidently run any SEM-related analyses. If you have a good and reliable set of respondents, then you may be able to get away with an even lower threshold, such as 50+3x. However, it is often very difficult to collect additional data after you have already finished one round of data collection. So, get enough the first time around. If you want to be safe, aim for 50+8x.