Posts tagged ‘Item Analysis’

Fidelity of an Assessment

imageFrom the calculation shown above, you can see that a worthy goal of creating, delivering and reporting on assessments is to minimize the Error of Measurement. However, this competes with another worthy goal; ensuring that assessments are affordable.

imageThe more we can simulate the actual performance environment and the actual knowledge to be recalled, and/or the skill and/or ability to be used the smaller the error of measurement.   So if we were assessing your driving performance we should put you in a car and go driving with you.

Simple enough! Driving without crashing is a worthy goal but hardly the bar that we should set for our driving tests!  We need to ask ourselves what real performance looks like, what are the behaviors that can predict good performance, and what is the appropriate ways to measure these.

So what are the performance characteristics of a good driver?  Good eye sight, ability to control the car, spatial awareness, understanding and obedience to road signs, signaling intent to others, etc.  It might be better to assess some of these attributes before we jump in a car!  So let’s determine the criteria for successful performance and then assess each attribute (sight, road signs, rules of the road, etc) with one form of assessment before we start witnessing someone’s actual performance.

imageWe would start by determining the knowledge, skills and abilities required to perform a task in a real world situation.  Then we would use an appropriate assessment and select less expensive assessments to start with and then progress closer to real world environments.

Now the question is, how do we assess someone’s potential performance within a low cost simulated environment?  The key is to place someone into the context of the performance environment. We can do this by using scenario questions with the stimulus that we can afford/justify:

  • Low cost
    Tell a story about a situation and ask questions related to that story.  “You travelled to Morocco with your friend who rented a car and you came upon a road sign blah, blah, blah. When would you turn left?”
  • Medium cost
    Produce pictures and sounds and let the pictures and sounds tell the story. “Please follow along with the pictures and answer the question when would you turn left?”
  • High Cost
    Show a video that simulates driving. The person could either interact with the simulation or answer multiple questions about the video and or simulation as they progress through the experience.

The closer the simulation is to the real world (sounds, smell, sight, danger, etc), the more accurate the measurement.  When we simulate something to 100% we are in the real world!

In certain situations we want to measure your performance whilst your adrenaline levels are high or when your fight or flight mechanisms are kicking in. To do that we have to raise the fidelity of the simulation without causing harm.

For instance, suppose we want to learn how you would react in a crash.  We can’t go around having you crash things as that would be a danger to yourself and others. However, by using low and high fidelity simulations we could produce pretty accurate predictors on how you would act during a crash. 

imageThe  picture above tries to illustrate this idea of fidelity. The higher the fidelity of the stimulus the more accurate the measurement can be.  However, maybe we don’t need or can’t afford high fidelity. So we have some options:

  • Text stimulus

    Advantages: Easy, inexpensive, and appropriate for many situations.
    Disadvantages: Might inappropriate assess reading skills. Rarely stimulates the body to react with fight or flight mechanisms which might cause a different outcome in real life than during an assessment. 

  • Picture stimulus

    Advantages: Easy, inexpensive, images can convey a real world situations. Can focus on assessing the topic rather than also assessment language skills.
    Disadvantage: Rarely stimulates the body to react with fight or flight mechanisms.

  • Video stimulus

    Advantages: Videos can repeatedly convey real world situations, Can focus on testing on the topic rather than also assessment language skills. Can stimulate more emotional reactions.
    Disadvantages: More expensive to produce.

  • Interactive stimulus (Gaming)




    Advantages: Games can convey real situations. Can focus on testing on the topic rather than also assessment language skills. Can stimulate more emotional reactions.
    Disadvantages: More expensive to produce.

I hope that you find this helpful.


May 20, 2009 at 6:13 pm Leave a comment

A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives


A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives

by Lorin W. Anderson (Author), David R. Krathwohl (Author), Peter W. Airasian (Author), Kathleen A. Cruikshank (Author), Richard E. Mayer (Author), Paul R. Pintrich (Author), James Raths (Author), Merlin C. Wittrock (Author)

Here is the link to A Taxonomy for Learning, Teaching, and Assessing. I don’t yield any money from this I just like to share the knowledge.

May 18, 2009 at 12:15 pm Leave a comment

Item Analysis

I was lucky enough to be able to spend some time today with Sharon Shrock and Bill Coscarelli, authors of the book Criterion-Referenced Test Development (3rd Edition) and I was reminded of the first time that I understood the basics of item analysis.  I wanted to share some knowledge with you that others have found useful.

The following discussion assumes a four choice multiple choice question/item as that makes it easier to explain and understand, however, it can be applied to the “outcome” of the participant answering the question, even if the item had tens of potential outcomes.  For example if you had a drag and drop item, or a multiple response item, there would a limited number of responses (even if that number were large) and thereby a limited number of outcomes. 

Before I go off on a tangent we’ll focus on a multiple choice item with four choices for this explanation.

Let’s imagine that we presented an item and all of the answers were equally right or equally confusing and all participants had to resort to guessing we’d end up with 25% of participants selecting each choice.  If there were more choices we’d have a smaller percentage guess the right answer.  So we have a guessing factor in a four choice multiple choice question of 25%. 


The table above illustrates this showing the four choices (A, B, C and D), the number of respondents that selected that choice (1,000 selecting each choice) and the resultant percentage being calculated in the third column.

In the real world we’d hope that people don’t have to resort to guessing (although some people have to) and so we might see resultants more like this:


These results were taken from an actual item delivered in a test by a Questionmark customer and provided for me to use as an example.  The right choice was C and I’ve highlighted this to make it easy to follow.

From the percentage we can determine the Difficulty or “P value” which is simply the percentage of people selecting the right choice expressed as a real number; so 64% becomes 0.64. 


Now let’s imagine that everyone selected the right choice we’d have 100% of people selecting it which would result in a Difficulty of 1.0. An item that everyone gets right is not particularly helpful and we might consider dropping it from our tests as it would not seem to discriminate between the more and less competent/knowledgeable. And conversely if no one selected the right choice (i.e. everyone got it wrong) we’d have a difficulty level of 0.0 which would tell us the question is too hard. 

Okay so we get that we need a Difficulty above 0 and below 1 and in our example it is .64 and most often good test items yield Difficulty levels above 0.6 and below 0.9.

So now we can calculate the Participant Mean which is the average score of the test for all of the Participants that selected this choice.  And so in our example all the Participants that selected choice “A” had their final tests scores calculated and when we then calculated the mean we determined that it was 35%. We can calculate this statistic for each choice and in our example we can see that the highest Participant Mean is derived form the people selecting choice C, the correct choice; very reassuring.  


It is also reassuring that Participant Mean has nothing to do with how mean the Participant are!

Now things get a little more interesting. 

We can calculate the Outcome Discrimination which is a number of –1 to +1.  It is a calculation based on the contrast of the upper and lower scoring groups that selected a choice.  A -1 value would indicate that this is an extremely bad items as only people that fail the test selected this choice.  Where as a +1 would provide a perfect Outcome Discrimination indicating that all people that were in the upper group selected this choice.


Outcome Correlation is similar to Outcome Discrimination but provides for a more extensive calculation.  Just as with Outcome Discrimination -1 is bad for the right choice but good for an incorrect choice.  


You always look for Outcome Discrimination and Outcome Correlation to be positive and typically you look for values above +0.3 but your rigor and concern over the statistical properties of a test item will depend upon whether it is being used for low, medium and/or high stakes assessments.

Well I hope that you found that useful and if you want to know more please check out Greg Pope’s blog posting on Psychometrics 101 or buy (and read) Sharon and Bill’s book!

April 7, 2009 at 2:00 am Leave a comment

Criterion-Referenced Test Development 3rd Edition


Criterion-Referenced Test Development 3rd Edition
by William Coscarelli (Author), Patricia Eyres (Author), Sharon Shrock (Author)

Here is the link to Criterion-Referenced Test Development 3rd Edition on Amazon.  I don’t yield any money from this I just like to share the knowledge.

April 2, 2009 at 10:12 pm Leave a comment

Types Of Assessments (Formative, Diagnostic, Summative, and Surveys)

I was working with Dr. Will Thalheimer of Work-Learning Research a few days ago on an interview style video around learning, e-learning, assessments and the macro economy’s impact on learning and assessment.  It was fun and I learned a bunch about the challenges that this type of production entailed. I’ll provide some links out to this when I get them form Will.

My discussions will Will, an expert in effective learning interventions, learning environments and the challenges of the forgetting curve prompted me to think more deeply about how we distinguish different style of assessments. And so I though it might be useful to share some distinctions with you on the types of assessments that we use during the process of learning.

Formative Assessments

My definition: Formative Assessments (quizzes and practices tests) are used to strengthen memory recall by practice and to correct misconceptions and to promote confidence in ones knowledge.


In the learning process we are trying to transfer knowledge and skills to a persons’ memory so that they become competent to perform a task. During that process people might fail to pay attention, fail to grasp everything taught or simply forget things even though they once knew it. Most learning environments use simple Formative questions as they can:

  1. Create intrigue in order to create a learning moment that motivates the learner to pay attention.
  2. Focus the leaner’s attention towards the importance of key topics.
  3. Reduce the forgetting curve; by recalling previous knowledge and skills we strengthen the ability to recall that knowledge or skill.
  4. We can correct misconceptions where someone formed invalid connections, however, that does border on the purpose of a Diagnostic assessment.

Typically a Formative Assessment does not need to store results as the job of the assessment is completed by providing the stimulus which causes the memory ‘muscle’ to be strengthened just as lifting a weight in a gym strengthen other muscles. Sometimes results are stored in order to track how instruction might be improved.

Diagnostic Assessments

DiagnosticAssessments When we visit our doctor we’d become concerned if our doctor prescribed pills without asking us any questions. Doctors typically ask where the pain is located, when the pain happens, is the pain associated with certain activities. The doctor might run tests on our blood or other bodily fluid (ugh)! And so it is with Diagnostic Assessments.

First we seek to understand the current knowledge, skill, and/or ability of the Participant so that we can diagnose the gap and thereby provide a prescription for learning if required.

Diagnostic questions might be self-assessment style of questions such as “Please rate your ability to ….” or test questions such as “Which savings plan would you suggest to a married man of 42 with 3 kids and dog and a large mortgage?”.  Either way the goal is to match the responses to the benchmark required in order to diagnose the gaps and prescribe something useful.

Diagnostic Assessments can be used to direct people to the right learning experience such as a class, conversation with a Subject Matter Expert (SME), a web search, a book, an elearning course, etc.

Diagnostic Assessments are not designed to stregthen memory recall, however, by their very nature they do provide some of those characterics.

Summative Assessments

SummativeAssessments Tests and exams designed to measure knowledge, skills, and abilities are known as Summative Assessments.  These are typically used to certify people have a certain level of knowledge, skills, and/or ability. Often these certifications grant people access to something previously not permitted such as a license to drive or be promoted within an organization or have physical access to dangerous materials.  because of this “Grant of Access” Summative Assessments are typically Higher Stakes assessments.  Typically Summative Assessments has “Pass” and “Fail” associated with them which distinguishes them from Formative and Diagnostic which don’t.  There are two basic types of Summative Assessments:

  1. Norm Referenced

    Where a Participant’s pass is determined by their positioning within a group of test takers.  The Participants’ results are compared to the others in the group after everyone has completed the assessment. This is often used in environments where the number of places in the next course or job role is limited and so only a certain number of people can pass.  A Norm Referenced test will tease out the best people within the group that took that test but the quality of competence passing will vary from one sitting of the test to the next.

  2. Criterion Referenced

    When the criterion for passing a test has been predetermined it is known as a Criterion Reference Assessment.  The most used Criterion Reference Assessment on the planet is the driving test. The criteria for passing has been determined prior to the test and you normally know whether you have passed or failed immediately; well you certainly could although sometimes administrative processes slow things up.

Criterion Reference Assessments are often used to certify Regulatory Compliance tests, HIPPA compliance, pre-employment test, and IT certification exams.


image We’ve all completed surveys and we all can recognize that results have to be stored and aggregated to help with the analytics. A common question type used within surveys is a Likert scale item developed by Rensis Likert, an American educator and organizational psychologist. Likert scales prompt a Participant with a statement and they respond by specifying their level of agreement to a statement which is then transposed to a number to ease the measurement and analytics process.

Course Evaluation are the most commonly used survey type within the Learning Process but others survey types include Job Task Analysis, Needs Analysis,  360  and other forms of peer review assessments, Employee Attitude, Customer Satisfaction, Partner Satisfaction and Political Opinion surveys.

Each form of Assessment has its own set of challenges for developing and maintaining the instrument, for delivering questions, providing feedback to the participants and stakeholders, and then performing the analysis of results; but that will have to wait for another blog entry!

March 22, 2009 at 2:50 pm 4 comments

Distinguishing Low, Medium and High Stakes Assessments

Trying to classify an assessments into a Low, Medium and High stakes category has some pros and cons.  On the plus side we can quick “range “ the time to create, deliver and report on an assessment, and range the impact to the person taking the assessment, and it can help conversations about an assessment.  All conversations have to be considered in context.  If you were having the conversation in the context of Psychometrics you would probably be distinguishing between low stakes and high stakes exams; if you were having the conversation with instructors and training you might be distinguishing assessments prior to a course, during a learning experience, tests after and course evaluations.


This chart can help us think about Low, Medium and High Stakes assessments in both context but without specific measure. Essentially it aids the conversation by promoting some distinctions and a vocabulary rather than providing a measurable outcome. That is in itself amusing because the world of assessments is all about the theory and technique of educational and psychological measurement.

In higher stakes assessments we tend to talk about candidates, in medium and lower stakes assessments we talk about students, employees and/or learners and in low stakes assessments we might talk about respondents. And so in our vocabulary, in the context of low, medium and high stakes assessments, we’ll talk about Participants who are the people that answer the questions in our assessments; they participate.

There are six terms that can help us provide some distinctions:

  1. Consequences to the Participant

    If the consequences to the participant are low then that helps us classify into a Low Stakes assessment but if the consequences are great (affecting Lives, Limbs, and/or Livelihoods) then it would be a High Stakes assessment.

  2. Legal Liabilities

    When stakes are high, consequences are high, and so in come the lawyers. I don’t want to turn this into a political debate but laws are written to protect our rights and lawyers help us understand the laws to do the right thing. All stakeholders in the assessment process have rights and responsibilities. Often this debate is taken from the side of the Participant but other Stakeholders have rights too. Assessments must be fair and reliable and fit for purpose. And we can’t go around certifying people to fix gas leaks that aren’t qualified.

  3. Proctoring/Invigilation

    When the stakes are high people are more motivated to cheat which requires that the assessment process is invigilated to prevent this and to promote trust in the assessment process.  As I fly around the world I woudl liek to know that my pilot and the air traffic controllers didn’t cheat on their exams. With certain kinds of, low stakes, assessments invigilation would provide an unwanted and undesirable level of supervision. Here’s some examples of assessments that we should probably not proctor/invigilate:

    1. Novice student taking an assessment where the goal is to provoke intrigue before a learning experience.
    2. Course evaluation where the moderator might choose to influence the outcome.
  4. Validity and Reliability

    Ideally all assessments should be reliable and valid.  They should work consistently over time and they should align with the subject matter that you are assessing. However, if we applied the same standards to low, medium and high stakes assessments we might never justify the costs for say Formative assessments or Course Evaluations.  We must always work ethically but we don’t need a $25,000 study on the validity and reliability of the assessment. When conducting a High Stakes test or exam we need to be sure that the assessment aligns with the topics that it is assessing and it is fair to all participant.

  5. Planning

    If you are administering a High Stakes test or exam to 5,000 Participants you’ll use a different plan than for 10 people evaluating a course. Planning will include considering how you’ll develop your assessment, how you’ll have expert(s) review, how will you maintain confidentiality and security, how will you deal with delivering the assessment, how will you provide accommodations for those with special needs, how will you report on the results to the Participant and Stakeholders.

  6. Psychometrician Involvement

    A Psychometrician is a Psychometrics professional familiar with the theory and techniques to measure knowledge, skills, abilities, attitudes, and personality traits.  Psychometricians are often involved with the development of High Stakes assessments such as tests and exams, and analyze the results to ensure that the assessments are valid and performing consistently and reliably.

It is worth reminding you that all assessments can be very valuable, within their context, regardless of how you categorize them.  Just because quizzes, designed to strengthen memory recall, and course evaluations, designed to measure in order to improve the learner’s environment, are low stakes does not mean that they provide low value. But those types of distinctions will follow in another blog entry!

March 22, 2009 at 1:26 pm 2 comments

Add to Technorati Favorites

Recent Posts