The pre-conference workshops are small group sessions that take place the day before the Conference begins (Friday, July 1).  Each workshop is a half-day, providing the opportunity to attend one or two sessions.  Registration is limited, and the cost, including lunch for all workshop participants, is $200 for one workshop (AM or PM) or $350 for two workshops (AM and PM).

Add a Workshop to your Registration

If you would like to add a workshop to your existing registration, simply:

DEADLINE:  Ticket sales end at 5:00 pm (Pacific time) on Wednesday, June 29

ITC 2016 Workshops

Each workshop is offered in the morning and again in the afternoon session.  Click the title to review the abstract.


Alina Von Davier


In this workshop I will introduce the basic concepts of computational psychometrics (CP; von Davier, 2015), focusing on data mining, machine learning, and data visualization with applications in assessment. CP merges the data driven approaches with the theoretical (cognitive) models to provide a rigorous framework for the measurement of skills in the presence of Big Data.  I will discuss five types of big data in educational assessment: a) ancillary information about the test takers; b) process data from simulations and games; c) data from collaborative interactions; d) data from multimodal sensors; and e) large data sets from tests with continuous administrations over time.

Learning technology offers rich data for developing novel educational assessments. The data obtained from learning and assessment technologies are typically “complex,” in the sense that they involve sources of statistical dependence that violate the assumptions of conventional psychometric models. These data are also referred to process data. Complex assessment data have also been hypothesized to involve a variety of “non-cognitive” factors such as motivation and social skills, thereby increasing the dimensionality of the measurement problem and the potential for measurement bias. The question is how can one measure, predict, and classify test takers’ skills in a simulation or game-based assessment? DM and ML tools merged with the theoretical (cognitive) models may help answer this question.

As mentioned before, there are other types of assessment data that may benefit from the DM techniques. For example, data that consist of test scores and background questionnaire data over many administrations from a test with an almost continuous administration mode. In this situation, one research question may be about patterns in the data, another may be whether tests scores for specific subpopuations can be predicted as part of the quality control efforts [4].

I will discuss similarities and differences between machine learning and statistical inference [5] and will present several free software available for machine learning and visualization [6], such as WEKA, RapidMiner, MIRAGE, and routines in R. I will also present examples of research projects and applications in assessment.


Boaz Shulruf


Item Response Theory (IRT) and related models are commonly utilise the information in tests to estimate person and item parameters for improving test and item quality, yet, with limited application to standard setting. The majority of the currently used standard setting methods rely upon judges to make decisions on the expected performance of a minimally competent examinee in a given test. Some methods (e.g. Bookmarks) use information obtained from IRT analyses to support judges’ decisions.

This workshop introduces techniques utilising the information obtained from IRT/Rasch analyses to establish defensible pass/fail decisions without additional post examination judgement. These techniques are also applicable when tests are multidimensional and/or apply non-compensatory rules.

In the workshop, the participants will learn how to apply Rasch models to standard setting when tests meet these criteria. The workshop will also introduce alternative defensible solutions when IRT/Rasch models are not applicable.

The facilitator Boaz Shulruf is the head of Medical Education Research at UNSW, Sydney, and is an expert in the topic.  Shulruf has developed new methods for standards setting and has published internationally in this topic. His standard setting workshops delivered in lead international Medical Education conferences received excellent feedback and initiated a number of international research collaborations.

The workshop agenda:

  1. Introduction to standard setting and a brief review of current methods and knowledge.
  2. Introduction to the Rasch Model and its applicability to standard setting
  3. Applying the Rasch Borderline Method  (RBM) to different tests
  4. Introduction to the Objective Borderline Method (OBM) as an alternative model when RMB cannot apply.
  5. Critical discussion: strengths  and limitations and summary

Equipment:  Participants will need to have access to jMetrik4 and Excel.


Mark Shermis


Though machine scoring of essays and short-form constructed responses has taken on increasing role in both summative and formative assessment, getting access to the scoring algorithms is a challenge at best.  Unlike CAT or other assessment technologies, there are no readily available commercial resources that one can purchase and only one program that is available the public domain.  This makes it difficult to both research the topic and to develop tools that might be employed in instructional settings.  The session is geared towards those with assessment or research responsibilities who might be interested in employing automated essay scoring for either assessment or instruction (or both).

Relevance and usefulness to the conference participations

The first portion of this workshop will focus on the capabilities and research involving the six main commercial automated essay scoring engines, e-rater, Intelligent Essay Assessor, CRASE, Bookette, Project Essay Grade, Intellimetric).  The second portion of the workshop will be devoted to work with LightSide, a public-domain automated essay scoring engine, to analyze a set of essays that can be used for either formative or summative assessment.

Mark D. Shermis has worked in the area of machine scoring for over 15 years.  He and co-editor Jill Burstein have published two books on the topic Automated Essay Scoring: A Cross-Disciplinary Approach (2003) and Automated Essay Evaluation: Current Applications and New Directions (2013) which have served as primary reference material for researchers and developers in the discipline.

Equipment:  Those who attend this session are encouraged to download and install the LightSide software which is available at:  Essay data will be provided for those using the software during the session. Workshop participants are expected to have their own laptops with either a relatively recent version of Windows or MacOs.


Maria Elena Oliveri, Cathy Wendler


Determining if a test is doing what it was designed to do involves understanding consequences of test results used for decision-making. Each test score use (e.g., improving policy and practice or informing language-based selection practices) is associated with consequences for one or more stakeholders (e.g., test takers, institutions, schools, and the public). While some consequences are intended, some are unintended. And while some unintended consequences may be positive, it is more likely the case that unintended consequences negatively impact stakeholders (AERA, APA, & NCME, 2014).

Although test publishers may wish it untrue, it is likely that every test is subject to some unintended, negative consequences, including those occurring as part of test use. This includes not considering contextual factors underlying data from international assessments, which may lead to inaccurate data interpretation thereby limiting the usefulness and meaningfulness of the data from international assessments to guide policy and practice. In recognition of this issue, The Standards for Educational and Psychological Testing (p. 19) concluded that “unintended consequences merit close examination.”

Few systematic mechanisms have been developed to help identify unintended consequences. Such approaches may require not only understanding how to detect unintended consequences but also how to create mitigation plans to alleviate unintended consequences. One emergent approach is the use of a Theory of Action (Bennett, 2010) to identify and mitigate unintended consequences.

In this workshop, participants will be presented with examples of unintended consequences, the process used to identify them, and possible solutions for addressing the consequences from Theories of Action developed across an array of assessments administered internationally. Participants will begin the development of a Theory of Action for their own assessment program. Finally, participants will be presented other examples of mitigation plans, such as building a validity rationale, to help support the validity of their own assessment program.


Bruno Zumbo, Yan Liu


Nearly all contemporary psychometric models make use of latent variables, of one form or another. An important consideration in using these models is the performance of the items in potentially heterogeneous populations of respondents. Many claims about the validity of our measures, and the inferences we make there from, are based on the premise that individuals interpret and respond to sets of items in a consistent manner such that the measurement model parameters are equivalently applicable to all people irrespective of any differences among them in our target population. The purpose of this workshop is to introduce latent class analysis in the context of the analysis of test data. The use of latent variable mixture models will be introduced and demonstrated in the context of examining the extent to which a sample is homogeneous with respect to a specified unidimensional model for categorical data and identify potential sources of sample heterogeneity. In addition to test level analyses, we will also introduce a new set of latent class methods recently introduced by the author to investigate heterogeneity at the item level – a new latent class item bias technique. The workshop is structured to focus on the fundamentals of the methods and demonstrate the techniques with real test data. Prior basic knowledge of structural equation modeling and confirmatory factor analysis, in particular, will be assumed.


Jacqueline Leighton


The collection of response process data is strongly recommended for tests designed to measure and inform conclusions about test-takers’ cognitive processes (Standards, 2014). However, recent research developments in think-aloud and cognitive-laboratory methods (e.g., Fox, Ericsson & Best, 2011; Leighton, 2013; Willis, 2015) indicate a need to revisit key aspects of objectives, sample considerations, interview techniques, data collection, coding and methods for analyzing and improving the quality of conclusions derived from these highly labour-intensive methodologies.

This workshop will introduce participants to key considerations in distinguishing think-aloud versus cognitive laboratories for collecting, analysing, and interpreting response process data to better satisfy and defend test development and validation objectives.

The facilitator has over 20 years of experience conducting empirical research using think-aloud and cognitive laboratory methods; in addition to writing critical analyses about best practices using these methods. She has published her research in leading testing journals (e.g., Educational Measurement: Issues and Practice) and is currently writing a methodologically focused book on think-aloud and cognitive laboratories in the series Evaluation Research: Bridging Qualitative and Quantitative Methods to be published by Oxford University Press.

Equipment:  Participants are expected to bring laptops to work on mini-tasks – coding, segmenting, and analysis.


Anita Hubley


The quality of the measures we use in research and clinical practice is of critical importance. The inferences we make from scores on psychological and health measures have impact – on theory, knowledge, policy, and individuals’ lives. This workshop reviews basic measurement principles of reliability and validity through the lens of modern validity and The Standards for Educational and Psychological Testing (American Educational Research Association [AERA], American Psychological Association [APA], and National Council on Measurement in Education [NCME], 2014). This workshop is relevant to test users, who are responsible for ensuring they have an adequate understanding of current psychometric theory and principles and use this knowledge when conducting reliability and validity studies or when using such studies to decide whether use of a measure is appropriate for their purpose, target audience, and context. This workshop will define key terms, contrast Trinitarian and modern perspectives on validity, and, based on The Standards and other recent literature, describe reliability evidence and each of the five sources of validity evidence. The workshop will address common questions about how much evidence is needed, whether all sources of validity evidence are needed for all measures, and the applicability of bodies of evidence for original and adapted/translated tests. Finally, based on reliability and validity syntheses, a summary of what evidence tends to be reported well and what does not will be presented. The workshop presentation will include examples and question/answer sessions. The facilitator is a Full Professor in Measurement, Evaluation, and Research Methodology at the University of British Columbia and former ITC Council member who has published over 85 refereed articles and book chapters and over 100 conference presentations related to psychological and health measurement, assessment, validation, and test development.


Barbara Byrne


Over at least the past two decades, there has been a rapidly growing interest in cross-national comparisons. Although such comparisons traditionally have been the province of cross-cultural psychologists, a review of the extant literature reveals this investigative work now to be of substantial interest within mainstream psychology, as well as with researchers across a broad band of other disciplines. A notable outgrowth of this work has been the burgeoning number of assessment scales being translated into other languages for use in countries and cultures that differ from the one in which the original scale was developed and normed. Importantly, inherent in the conduct of multigroup comparisons across national/cultural groups are two critical assumptions: (a) that the translated assessment scale is operating equivalently across the groups; and (b) that given the hierarchical structure of multicultural data (i.e., individuals nested within countries), the scale is operating equivalently across individual and country levels. As with all statistical analyses, these assumptions must be tested. Based on several different CFA models within the framework of structural equation modeling (SEM), workshop participants will be walked through the hierarchical set of steps needed in testing both of these critically important assumptions. In broad terms, this workshop focuses on both the analytic procedures involved in testing for the equivalence of an assessment scale across national/cultural groups, and on the many and diverse issues contributing to complexities associated with these analyses. More specifically, the purpose of this workshop is twofold: (a) to present a nonmathematical introduction to the underlying rationale and basic concepts associated with tests for multigroup and multilevel equivalence, and (b) to discuss and illustrate by example, substantive, psychometric, and statistical issues that can impact these tests for equivalence when the groups under study represent different nations and cultures. To gain the most from this workshop, participants should have an understanding of basic SEM concepts.