TESOL test design and administration is no easy task. Ask anyone who has tried. Tests take practice writing and require constant revision. This part of our series on testing in TESOL focuses on some issues to be aware of when assessing your assessment.


Bias can be seen in a variety of ways in testing. One form you may not have considered is test bias. Test formats vary around the world, and some students may not be familiar with the format of your test. Those who have never taken a multiple choice test or an extensive oral interview may not perform as well as those who are familiar with the test format, despite having similar skill levels.

Another form of bias that can be found within a test is cultural bias. If your test is action packed with obscure pop culture references or American football statistics, your Bhutanese refugees will likely be confused.

A final form of bias has nothing to do with the test, but rather with the examiner. Is it possible for an examiner to score one student lower on a performance test than another because the examiner in question has mixed-feelings about the learner? Yes, it’s possible. Conversely, an examiner might score a learner higher if the examiner likes the learner. Yes, that’s also possible.

In fact, without a rubric, authentic assessment is subjective and leaves room for bias. Therefore it is highly recommended to develop a rubric for your tests, which outlines specific criteria for scoring. It is also recommended that you follow that rubric closely when determining scores.


Generally, reliability refers to whether a test can successfully be replicated and still is accurate. I am going to introduce two other types of reliability that will also help you ensure that you are eliminating bias from your authentic assessment.

Inter-rater reliability requires at least two examiners to score a test. The assessment from both raters is compared to ensure that scores are consistent. If there are differences in the scores, examiners reevaluate the test. If only one examiner is available, then you can still use intra-rater reliability. Here’s what you do. Grade the test, wait a day or two, and then grade it again. Do you get the same results?


Last but certainly not least, is validity. In fact, this may be the most important question you ask yourself. Does the test assess what I want it to? Simply put, a test that intends to assess writing skills but only requires speaking is not valid. There are numerous forms of validity that I encourage you to research, but there is one I want to touch on here: external validity.

Given that your test is indeed internally valid, an externally valid test will correlate with performance in other cases. For example, if a test says that a student will be able to function at a university with their current proficiency, and that student in fact is successful, then the test could be considered valid. If the test predicts the student is proficient enough, but reality says otherwise, the test is externally invalid.

Why have I chosen to highlight this particular aspect of validity? In the previous posts of this series, I encouraged you to consider tests that closely reflect real-world tasks that the student would encounter. If that is indeed your aim when designing your test, then external validity is you measure of success.