Posts tagged ‘Test Development’

Assessment Maturity Model

Over the last few years I have been working with customers and friends on an Assessment Maturity Model. It started with brainstorming, it was developed using a wiki and it has been tested by representing the idea to a number of groups around the world.  Now I feel that it is ready to take to the next step.

Over the last few weeks I have been formalizing the model to make it easy to understand and building a web site 

The premise behind the Assessment Maturity Model is that if you can’t measure it you can’t manage it. But to measure it you need to know what "it" is. 



The Assessment Maturity Model proposes that there are six key performance indicators, known as Measures, within the three key Areas of Assessments, namely:

  • Assessment Development
    This area includes all aspects of authoring (creating and maintaining) the items and assessments.
  • Assessment Delivery
    This area includes all aspects of administering the assessment to candidates, respondents, participants, etc.
  • Presenting Results
    This areas deals with presenting results in a trustworthy way to the stakeholder in meaningful context.

Six key performance indicators, known as Measures, are tracked to provide an indication of maturity and these are:

  • Stakeholder Satisfaction
  • Security
  • Strategic Goals
  • Processes
  • Data Management
  • Communications with Stakeholders

image By tracking these Measures an organization can determine where they are and can plan for where they want to be. These can be tracked by a single Area, as shown in the graphic to the left, or with the three Areas combined for an overview. The graphic to the left shows how an organization can track by area based on "Quality" and "Efficiency".

If you manage any type of assessment program I’d encourage you to take time to learn more about the Areas, Measures and Phases of the Assessment Maturity Model.  or just

Please feel free to link to the Assessment Maturity Model web site at, cross link from other web sites and blogs, twitter about it, email me about it, comment here about it and/or even tell your friends about it! 

Watch out for more – this isn’t finished yet!


August 14, 2009 at 12:14 am Leave a comment

Society for Industrial and Organizational Psychology “SIOP” discussion of the US Supreme Court case Ricci v. DeStefano

I wanted to share some well informed points of view around testing in the workplace and commentary on the recent Ricci v. DeStefano case from the SIOP web site.

SIOP is a well respected Division of the APA with a mission to enhance human well-being and performance within work settings by promoting the science, practice, and teaching of industrial-organizational psychology.

Industrial-organizational (I-O) psychology is the scientific study of the workplace. Rigor and methods of psychology are applied to issues of critical relevance to business, including talent management, coaching, assessment, selection, training, organizational development, performance, and work-life balance.

Excerpt from article by Clif Boutelle and Stephany Schings:

With its recent narrow 5-4 decision in the Ricci v. DeStefano case, the Supreme Court ruled that results of a valid, job-related test cannot be thrown out simply because the test may result in adverse impact. The decision has underscored the importance of valid, job-related tests, and for many I-O psychologists, it has reaffirmed their status as central and vital players in the testing process.

Read the complete SIOP Article here

My original blog posting on the Ricci v. DeStefano

July 16, 2009 at 4:26 pm Leave a comment

US Supreme Court ruled that results of a valid, job-related test cannot be thrown out

The US Supreme Court recently ruled that white and Latino firefighters in New Haven, Connecticut were discriminated against when the city failed to certify the results of an exam.  The full text of the ruling can be found at: 

The city followed well established guidelines and best practices for test design but became nervous when one racial group was disproportionally impacted. This resulted in a heated public debate and a lawsuit that’s travelled a rocky road all the way to the Supreme Court.

I am not a lawyer and do not wish to discuss the merits of this particular case. Dramatically smarter people have gotten involved and the Supreme Court opinion is well worth a read.

I do believe that here are some lessons that we can learn from this case:

  • Valid and reliable exams assist with promoting a meritocracy.  This is very important when life, limb and livelihoods are on the line. The union’s guidelines and city processes promoted a meritocracy.
  • Job analysis to ensure that the test is aligned with the job is important and was effectively conducted in this case. As the court stated “There is no genuine dispute that the examinations were job-related and consistent with business necessity.”
  • Establishing processes and following them is important. Processes had been established and followed by the city except that when there was a dispute the city broke the process by not certifying the test results.

The city’s failure to certify the results, despite effective job analysis and an effective exam, caused this protracted legal process. Unions often get a bad rep, but in this case–as firefighters have to trust each other when their lives are on the line–they were the ones fighting for a meritocracy and approved the examination processes. Hats off to them for that.

The “fairness” or “bias” of high-stakes tests are common themes when results are challenged/appealed. So validity, reliability and cut scores require meticulous attention. Although this ruling raises the bar for challenges based on racial bias, it does not eliminate the need for employers to be vigilant about test quality. Questionmark’s white paper on defensibility addresses  test reliability, validity and other essentials.  It’s available for download at:

I’m happy that the US Supreme Court has ruled in favor of exams and demonstrated their appreciation for best practices in the areas of job analysis, unbiased development, expert review, peer review, and agreeing, documenting and following processes.

June 30, 2009 at 8:15 pm Leave a comment

Item Analysis

I was lucky enough to be able to spend some time today with Sharon Shrock and Bill Coscarelli, authors of the book Criterion-Referenced Test Development (3rd Edition) and I was reminded of the first time that I understood the basics of item analysis.  I wanted to share some knowledge with you that others have found useful.

The following discussion assumes a four choice multiple choice question/item as that makes it easier to explain and understand, however, it can be applied to the “outcome” of the participant answering the question, even if the item had tens of potential outcomes.  For example if you had a drag and drop item, or a multiple response item, there would a limited number of responses (even if that number were large) and thereby a limited number of outcomes. 

Before I go off on a tangent we’ll focus on a multiple choice item with four choices for this explanation.

Let’s imagine that we presented an item and all of the answers were equally right or equally confusing and all participants had to resort to guessing we’d end up with 25% of participants selecting each choice.  If there were more choices we’d have a smaller percentage guess the right answer.  So we have a guessing factor in a four choice multiple choice question of 25%. 


The table above illustrates this showing the four choices (A, B, C and D), the number of respondents that selected that choice (1,000 selecting each choice) and the resultant percentage being calculated in the third column.

In the real world we’d hope that people don’t have to resort to guessing (although some people have to) and so we might see resultants more like this:


These results were taken from an actual item delivered in a test by a Questionmark customer and provided for me to use as an example.  The right choice was C and I’ve highlighted this to make it easy to follow.

From the percentage we can determine the Difficulty or “P value” which is simply the percentage of people selecting the right choice expressed as a real number; so 64% becomes 0.64. 


Now let’s imagine that everyone selected the right choice we’d have 100% of people selecting it which would result in a Difficulty of 1.0. An item that everyone gets right is not particularly helpful and we might consider dropping it from our tests as it would not seem to discriminate between the more and less competent/knowledgeable. And conversely if no one selected the right choice (i.e. everyone got it wrong) we’d have a difficulty level of 0.0 which would tell us the question is too hard. 

Okay so we get that we need a Difficulty above 0 and below 1 and in our example it is .64 and most often good test items yield Difficulty levels above 0.6 and below 0.9.

So now we can calculate the Participant Mean which is the average score of the test for all of the Participants that selected this choice.  And so in our example all the Participants that selected choice “A” had their final tests scores calculated and when we then calculated the mean we determined that it was 35%. We can calculate this statistic for each choice and in our example we can see that the highest Participant Mean is derived form the people selecting choice C, the correct choice; very reassuring.  


It is also reassuring that Participant Mean has nothing to do with how mean the Participant are!

Now things get a little more interesting. 

We can calculate the Outcome Discrimination which is a number of –1 to +1.  It is a calculation based on the contrast of the upper and lower scoring groups that selected a choice.  A -1 value would indicate that this is an extremely bad items as only people that fail the test selected this choice.  Where as a +1 would provide a perfect Outcome Discrimination indicating that all people that were in the upper group selected this choice.


Outcome Correlation is similar to Outcome Discrimination but provides for a more extensive calculation.  Just as with Outcome Discrimination -1 is bad for the right choice but good for an incorrect choice.  


You always look for Outcome Discrimination and Outcome Correlation to be positive and typically you look for values above +0.3 but your rigor and concern over the statistical properties of a test item will depend upon whether it is being used for low, medium and/or high stakes assessments.

Well I hope that you found that useful and if you want to know more please check out Greg Pope’s blog posting on Psychometrics 101 or buy (and read) Sharon and Bill’s book!

April 7, 2009 at 2:00 am Leave a comment

Criterion-Referenced Test Development 3rd Edition


Criterion-Referenced Test Development 3rd Edition
by William Coscarelli (Author), Patricia Eyres (Author), Sharon Shrock (Author)

Here is the link to Criterion-Referenced Test Development 3rd Edition on Amazon.  I don’t yield any money from this I just like to share the knowledge.

April 2, 2009 at 10:12 pm Leave a comment

Tests That Work: Designing and Delivering Fair and Practical Measurement Tools in the Workplace

TestsThatWorkTests That Work: Designing and Delivering Fair and Practical Measurement Tools in the Workplace

by Odin Westgaard

March 19, 2009 at 11:52 pm Leave a comment

Add to Technorati Favorites

Recent Posts