Guideline
How the guideline works
The guideline aims to help you pick the appropriate scale, evaluate whether it was developed adequately, and determine its validity as a measure. It includes 13 high-level questions that can help you determine whether a candidate scale will be adequate for your research. These guideline questions should be applied to the scale development or validation paper. If the scale does not meet the acceptable criteria (as outlined below or in the paper), then it does not get the point for that guideline question. The maximum number of points a scale can achieve is 13/13 or 100%. Below, we will provide a very high-level review of the information that is contained in the guideline.
Click here for a Glossary
The guideline questions are grouped into the three stages of the scale development and validation process: item development, scale development, and scale evaluation.
Stage 1: Item development
Stage summary: Item development refers to the process by which the items of the scale are created. Each item is intended to capture the construct of interest either in part or in full. There are three main questions the reader can ask of the scale at this stage:
-
Is the construct that the scale attempting to measure defined clearly somewhere in the paper? A clear and precise definition of the construct is critical. This can be achieved using a theory or data-driven approach depending on whether an agreed-upon theoretical framework of the construct exists or does not, respectively. The reader should ensure that the reported definition of the construct that is being measuring by the scale matches the construct of interest for their research project.
-
Is the item generation process discussed (e.g., via a literature review, the Delphi method, or crowd-sourcing)? Look in the scale development paper for any information about how the items were generated. Typically authors generate items via a literature review (i.e., by reviewing existing scales and theoretical and empirical literature) but this can also be done using the Delphi method which is an iterative process that requires an expert panel to evaluate items. Lastly, another method that is common in HRI is what we call “crowd-sourcing” which refers to a broad variety of methods where lay persons are recuirted to consider their interpretation of the construct and provide their thoughts about items and stimuli.
-
Does the final version of the items capture the construct as it has been defined by the authors? The first step is to ensure that the items are listed verbatim. If they are not, it is impossible to determine whether the items are approriate for your research. If the items are listed verbatim, refer back to the stated definition of the construct to ensure that the items are related to this definition. Specifically, ensure that there are no additional constructs that are captured in the final version of the scale.
Stage 2: Scale development
Stage summary: Though there are many different methods for developing a scale (e.g., classical test theory or item response theory), there are some components of the process that are consistent across methods. This section of the guideline includes seven questions.
-
Did the scale developers report the full initial set of items? Ensure that the developers of the scale made the full initial set of items publicly available, either by reporting them in the main text of the paper, in an appendix, or in an online repository. This is important because it will allow the reader not only to determine whether the items capture the construct (i.e., guideline question three) but also whether the sample size is large enough to determine a factor structure (i.e., guideline question five). Additionally, the reader can determine if the factor loadings for all the items meet the relevant stated criteria (i.e., guideline question eight) and whether the scale developers removed items appropriately (i.e., guideline question nine).
-
Does the test sample size meet the 10:1 minimum criteria? Sample sizes for scale development studies should follow the 10:1 (people to initial number of items) rule, though more participants is considered a positive feature. This rule pertains to the initial set of items, not the final version of the scale.
-
Did the scale developers perform an ECA, PCA, Rasch analysis, or similar test to determine the item to factor relationship? There are many methods that can be used to determine the underlying factor structure of the construct of interest. The reader should determine whether the scale developers report using at least one scale development method (such as EFA, PCA, or Rasch) in their paper.
-
Did the scale developers describe how their determined the number of factors? Constructs can be unidimensional (consisting of one factor) or multidimensional (consisting of more that one factor). You should verify that the scale developers reported how they determined or verified the number of fators that exist within the construct.
-
Did the scale developers provide factor loadings (EFA/CFA) or item fits (Rasch) of all items? You should look for quantitative values that indicate how the items in the scale relate to the construct of interest. These values can be in the form of factor loadings (if the scale development process used an EFA or CFA) or in the form of infit/outfit values (if used Rasch analysis).
-
Is there a description of the item removal process (e.g., using infit/outfit, factor loading minimum values, or cross-loading values)? It is very likely the initial set of items in its entirety will either not be appropriate for the construct or will not be able to capture the full scope of the construct. Having a principled way of removing items that do not fit within the construct is necessary, as is the detailed reporting of that procedure. Item reduction can be done in a number of ways depending on the scale development method. If items were removed, the reason (e.g., lack of fit or redundancy) should be explicitly mentioned; quantitative criteria should also be reported when possible.
-
Did the scale developers report the complete list of items included in the final version of the scale? It is critical that the final version of the scale in the publication is clearly reported to ensure that the scale is used as intended. The reader should look for this information either in the main text of the publication, in an appendix, or in an online repository.
Stage 3: Scale evaluation
Stage summary: Scale evaluation occurs after the original scale is created and attempts to answer the following three questions.
-
Did the scale developers include a factor structure test (e.g., second EFA, CFA, DIF, test of unidimensionality if using Rasch, or similar)? After a scale has been created, it is best to determine if the scale has the same factor structure on a different sample. Check to see if there is a test for factor structure. A confirmatory factor analysis (CFA) or a Differential Item Function (DIF) are common approaches.
-
Was a measure of reliability (e.g., Cronbach’s α or McDonald’s ωt or ωh, Tarkkhonen’s Rho) reported? Reliability refers to the principle that a mesaurement produces similar results under similar conditions and is related to one of the core components of science: replicability. There should be some test of the scale’s reliability in the paper. This can be completed using metrics such as ωt or Ω hierarchical in addition to Cronbach’s coefficient α. A reasonable minimum threshold for reliability is ≥ 0.80.
-
Was a test of validity (e.g., predictive, concurrent, convergent, discriminant) reported? Validity measures the extent to which the scale actually measures the latent dimension it was developed to evaluate and is a fundamental concept within psychological measurement. Look for comparisons of the scale of interest to others in the field and see if there are any relationships that exist. If there is a strong relationship between scales measuring distinct constructs or factors, then more work needs to be done before the scale can be used. If no report of validity has been conducted, then you should explicitly report this limitation when publishing or presenting results using that scale. If you have the resources, we encourage you to conduct a validation study and publish the results!
Note: the guideline includes recommendation for minimum acceptable criteria. Where possible we provide citations for recommendations with exact values. These values can and should be interepreted as heuristics. We do not encourage you to discard a scale simply because it does not meet a specific threshold that is suggested here.