九种方法轻松筛选无效作答——及对问卷研究设计的启示
核心内容:利用问卷设计(插入指定题目)和统计分析(如作答时间,longstring, outlier, 作答一致性, R包careless)帮你轻松识别无效作答。
引言:随着网络及智能手机的普及,传统的一些纸笔测验渐渐被网络问卷所替代。网络问卷由于其便捷,很大程度上帮助了研究者和被试。2018年JOM一篇文章专门谈到越来越多的研究者使用网络问卷收集数据,发表的文章也呈现逐年增加的趋势(参见Porter, Outlaw, Gale, & Cho, 2018)。
且很多文章都发表在领域内比较好的刊物,如AMJ (53), ASQ(12), JAP高达130多篇。
Porter, C. O. L. H., Outlaw, R., Gale, J. P., & Cho, T. S. (2018). The Use of Online Panel Data in Management Research: A Review and Recommendations. Journal of Management, XX No. X, 1–26. https://doi.org/10.1177/0149206318811569
然而,网络问卷调查最大的问题在于研究者不能直接观察你的被试,对于被试是否认真作答完全不知情。虽然你可以设定每个题目为必填项,这样也就没有了“缺失值”,然而你无法得知被试有没有读指导语和题目,甚至极大可能只是随便填写。
你也可以控制作答时间,比如某一页设置至少多少秒。我们的被试也很聪明,不可能给你直接填写5555555555555……这样的循环。如果是123411234呢。即便你一眼就可以识别有50个5或者20个1234类似的重复,如果你有500问卷,是要一个个去看?
还有的人可能填写345345345,或者不时调整换个数字,中间间隔几个不一样,这样的完全无意义的作答,根本不看题目的被试对于你的研究又有何用?所以,数据筛选很有必要。
Porter 等人2018年JOM那篇文章对于网络数据库使用建议中也明确提到数据筛选, attention check(见下截图)。
Porter et al. 2018, JOM
可以预料,随着越来越多的online data使用,随后的审稿及发表对于数据筛选会越来越重视。所以,不如乘早学习一些方法,即便不知道是不是真的有用(持保留态度),但至少不要让审稿人在数据筛选给你文章挑刺。也会一定程度上增加我们研究设计的严谨性。本文将介绍九种常见数据筛选的方式,并且告诉你如何用统计软件实现(而非肉眼去看)。
需要说明一点:下文所提方法基于5篇核心文献。如果有遗漏更好的办法,欢迎分享……
Curran, P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4–19. https://doi.org/10.1016/j.jesp.2015.07.006
DeSimone, J. A., Harms, P. D., & & DeSimone, A. J. (2015). Best practice recommendations for data screening. Journal of Organizational Behavior, 36(2), 171–181. https://doi.org/10.1002/job.1962
DeSimone, Justin A., & Harms, P. D. (2018). Dirty Data: The Effects of Screening Respondents Who Provide Low-Quality Data in Survey Research. Journal of Business and Psychology, 33(5), 559–577. https://doi.org/10.1007/s10869-017-9514-9
Huang, J. L., Curran, P. G., Keeney, J., Poposki, E. M., & Deshon, R. P. (2012). Detecting and Deterring Insufficient Effort Responding to Surveys. Journal of Business and Psychology, 99–114. https://doi.org/10.1007/s10869-011-9231-8
Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data. Psychological Methods, 17(3), 437–455. https://doi.org/10.1037/a0028085
在具体谈到筛选无效被试方法之前,我们先看看哪些因素可以影响被试是否认真作答。Meade 和 Craig (2012) 发在Psychology Methods 上的文章提到,四个因素可以影响(Factor influence vareless responses):
· Interest_ 被试是否感兴趣
(很多心理学研究让本科生参与问卷调查只是换取学分,很难说被试对研究感兴趣),对应的可以use incentives,比如钱……
· Survey Length问卷的长度
_ 避免太长,将心比心,问卷越长大家越不耐烦,越容易出现胡乱填写。
· Social Contact 社会接触_
被试与研究者之间是否能面对面交流。显然在线问卷调查很难实现这点。
· Environmental Distraction 周围环境的干扰。
比如被试可能只是在地铁上,上下班无聊,顺便随手填个问卷,还能挣个晚饭。。。外部环境的干扰肯定会影响作答质量,这也是心理学实验为什么要在实验室进行,小房间什么也没有只有一台电脑,以尽可能减少干扰。
研究设计的启示:
1. Using longer items
2. Using both positive and negative wording
3. If possible observe the participants & time their response (e.g., online)
(DeSimone, Harms, & & DeSimone, 2015)
你已经付了被试费,然而不能见你的被试,问卷的长度也控制到100题左右。然而,那么接下来的问题是如何筛选不认真作答被试呢?简单而言可以通过两种方式:其一是在问卷设计时候插入一些筛选题目;其二是数据收集完成之后通过一些统计方法进行识别和筛选。Generally two types: Study design and psot-hoc analysis (Meade & Craig, 2012). 最主要的三种非干扰方法为: recording response time (Behrend, Sharek, Meade, &Wiebe, 2011; Berry et al., 1992), the number of consecutive identical responses provided by the respondent (“longstring”; Behrend et al., 2011; Huang et al., 2012; Meade & Craig, 2012), and individual response variability (IRV; Dunn, Heggestad, Shanock, & Nels, in press).
(DeSimone et al., 2015)
注:为了便于大家后期自己研究中使用,下文我都会用中英文,注明参考文献。
1. 问卷设计
问卷设计:通过插入一些特定筛选题目以识别被试是否认真作答。
方法一:诈选题Bogus items
Bogus items contain content that is either obvious or ridiculous.
利用一些事实上明显不成立或者错误的信息,如果被试错选则有可能没有认真看题。
bogus items (“I was born on February 30” and “I have exactly 354 best friends”)
“I have 17 fingers on my left hand” and “Iwas born on planet Earth.” (DeSimone et al., 2015)
来源: (Meade & Craig, 2012)
(e.g. “I have never used a computer”; Huang et al., 2014).
1. 我从未使用过手机
I have never used a mobile phone (Huang et al., 2014))
2. 妖精们每两周付我一次钱
I am paid biweekly by leprechauns (Meade & Craig, 2012).
3. 我所有的朋友都说我是一个很棒的哈巴狗
All my friends say I would make a great poodle (Meade & Craig, 2012)
(1 = strongly disagree, 7 = strongly agree) correct answers to “I am paid biweekly by leprechauns” may be ‘strongly disagree’ and ‘disagree.’ Incorrect answers may include ‘slightly disagree,’ ‘neither agree nor disagree,’ and all of the identically mirrored ‘agree’ options. Items written in the opposite direction (where ‘agree’ would indicate attentiveness) (Curran, 2016)
这个选项不认真作答的被试可能选择:既非同意也非不同意, 中立,同意,非常同意
Warning: 这个方法有风险,有时候被试觉得好玩,甚至自认为有幽默感的可能故意选择同意……
方法二:指定选项题目Instructed items
顾名思义,就是指定选择某个选项。
One common and fairly transparent technique instructs respondents to answer with a particular response option, such as “Please select Moderately Inaccurate for this item” (Huang et al., 2012).
例: 此题请选C;
(“Please indicate option [X] for this question”) (DeSimone, Harms, & & DeSimone, 2015)
(e.g., “To monitor quality, please respond with a two for this item”).
此题,请留白:“Please leave this item blank”. (DeSimone et al., 2015)
鉴于被试注意力会起伏不定,建议设置多个指定选项题目。Researchers should be aware that respondents may fluctuate in effort throughout the survey. Consequently, it is advisable to insert multiple instructed items into a survey (DeSimone et al., 2015)
Warning! 以上两种方法对于完全不看题目和选项的被试有效,但是对于故意扭曲答案的被试不起作用。Bogus and instructed items are useful for identifying inattentive participants, but may be less suitable for identifying respondents who intentionally distort their responses (DeSimone, Harms, & & DeSimone, 2015).
方法三:自我报告是否认真作答 (self-reported diligence)
简单粗暴,直接问被试有没有认真回答。
Self-report indices generally appear in the form of a question (or series of questions) at the end of a survey addressing attention, effort, or thoughtfulness. Although straightforward, a major limitation lies in the transparency of this technique, rendering it vulnerable to dishonesty and demand characteristics (DeSimone et al., 2015).
1. Instruction:“I verify that I have carefully and honestly answered all questions on this survey.” (Meade & Craig, 2012)
2.Self-reported single item (SRSI) indicators.“Lastly, it is vital to our study that we only include responses from people that devoted their full attention to this study. Otherwise years of effort (the researchers’ and the time of other participants) could be wasted. You will receive credit for this study no matter what, however, please tell us how much effort you put forth towards this study.”
“I put forth ____ effort towards this study” with response options of 1 = “almost no,” 2 = “very little,” 3 = “some,” 4 = “quite a bit,” and 5 = “a lot of.” (Meade & Craig, 2012)
3. t “Also, often there are several distractions present during studies (other people, TV, music, etc.). Please indicate how much attention you paid to this study. Again, you will receive credit no matter what. We appreciate your hon- esty!” SRSI Attention was then assessed as the response to the item
“I gave this study ____ attention” with options 1 = “almost no,” 2 = “very little of my,” 3 = “some of my,” 4 = “most of my,” and 5 = “my full.” (Meade & Craig, 2012)
4. Last, we asked, “In your honest opinion, should we use your data in our analyses in this study?” with a 1 = “yes” or 0 = “no” response. (Meade & Craig, 2012)
5. “I occasionally answered items without reading them.” (DeSimone et al., 2015).
6. (e.g., “I carefully considered each item before responding”; Berry et al., 1992; Costa & McCrae, 1997),
这也是基于被试“诚实”假定下才有效 ……
2.统计分析法
方法四:作答时间 (response time)
Response Time
Using response time as a screening technique relies on the assumption that there is a minimum amount of time that respondents must spend on an item in order to answer accurately. Although variations in reading speed and item length make cutoff scores difficult to justify, it is “unlikely for participants to respond to survey items faster than the rate of 2 s per item” (Huang, Curran, Keeney, Poposki & DeShon, 2012, p. 106).
The response time screen was computed using the average number of seconds required to complete each item. For example, a score of 1.0 indicates that the participant required 1 s/item while a score of 2.5 indicates that the participant required 2.5 s/item (DeSimone et al., 2015).
Huang et al. (2012) suggested screening participants who required less than 2 s/item.
Kurtz and Parish (2001). Examining self-report and response time measures indicated that their use to screen the data is better than doing nothing.
现在网络作答都会有response time, 但是这种只有在没有设定强制作答时间才有效。
统计实现:SPSS, Mplus等软件只需描述性统计,查看Minimum 作答时间,筛选出每道题小于2s的被试。
方法五:奇数偶数题目作答一致性Even-Odd Consistency measure
简而言之,就是个人作答题号奇数偶数题的一致性,其做法是将单维度问卷分成两半,然后计算相关系数。
An additional index recommended by Jackson (1976, as cited in Johnson, 2005) was examined which we termed the Even-Odd Consistency measure. With this approach, unidimensional scales are divided using an even-odd split based on the order of appearance of the items. An even subscale and also an odd subscale score is then computed as the average response across subscale items. A within-person correlation is then computed based on the two sets of subscale scores for each scale. (Meade & Craig, 2012)
The correlation was then corrected for decreased length using the Spearman–Brown formula. Low individual reliability indicated IER (Huang et al., 2012).
统计方法:将单维问卷拆分成奇数偶数题目,计算相关系数,系数较低的可能(非必然)为不认真作答被试。
方法六:连续相同作答分析LongString Analysis
Lengthy strings of invariant responses (i.e., the same option being selected repeatedly) may be indicative of low-quality data.
计算被试对于不同题目提供连续相同答案的数量,5555555, 345,6666666, 六个“5”, 七个“6”。
Response patterns in which respondents consistently respond with the same answer (e.g., “5”) can be identified via an approach recommended by Johnson (2005). This index, termed LongString is computed as the maximum number of consecutive items on a single page to which the respondent answered with the same response option. (Meade & Craig, 2012)。
问题是到底多少个连续作答才能算是不认真作答,可以删除呢。 答案是目前并没有明确的说法:
Given that more extreme responses are less likely, researchers have recommended screens on the basis of 6 to 14 invariant responses in a row depending on which response options are being endorsed (Costa & McCrae, 2008; Huang et al., 2012). The longstring screen is recommended when researchers are administering multidimensional surveys or questionnaires with a mixture of positively and negatively scored items (DeSimone et al., 2015)。
There are no established global cut scores in place for it. In keepingwith the approach of Huang et al. (2012) regarding a conservative cut score for response time, this paper will suggest baseline rule of thumb that individuals with a string of consistent responses equal or greater than half the length of the total scale be considered as C/IE responders by this tech- nique (Curran, 2016).
个人倾向于支持Curran, 就是如果连续作答的数目超过了问卷长度的一半,肯定可以判断为无效数据。其它的就需要谨慎。如果你想要更严格一些,那么下面的可以用,8,或9作为cutoff score,下面是支持的文献。
Costa and McCrae (2008), participants who indicate consecutive strings of at least six “strongly disagrees,” nine “disagrees,” ten “neither agree nor disagrees,” fourteen “agrees,” or nine “strongly agrees” should be flagged. Huang et al. (2012) revised these estimates to seven, seven, twelve, ten, and eight, respectively. The cutoff of nine invariant responses was chosen because it reflects the median ofCosta andMcCrae’s(2008) analysis and is close to the mean (8.80) ofHuang et al.’s(2012)analysis.
此外,这种方法可能在混合了反向计分或正面描述及反面描述题目的情况下才有效,比如工作倦怠和工作投入。
方法七:异常值分析multivariate outlier analysis
The Mahalanobis D statistic (Mahalanobis, 1936) is a multivariate version of outlier analysis that compares a respondent’s scores to the sample mean scores across all.Specifically, the Mahalanobis D is an estimate of the multivariate distance between a respon- dent’s scores on survey items and the sample mean scores on survey items. The underlying assumption of this technique is that extreme deviation from the normative response pattern may be indicative of insufficient effort (DeSimone et al., 2015). Recent evidence suggests that Mahalanobis distance can be effective at identifying inattentive responses (Ehlers et al., 2009).
方法八:问卷法:社会赞许和撒谎问卷 social desirability & lie scale
Examples include social desirability (e.g., Paulhus, 2002) and lie scales (e.g., MMPI-2 Lie scale), special scales designed to assess consistent responding (e.g., the MMPI-2 VRIN and TRIN scales)(Meade & Craig, 2012).
在这些问卷上得分较高的被试,值得关注。
方法九: 语义相近题目作答的一致性Semantic synonyms
The semantic synonym technique is designed to identify respondents who indicate dissimilar responses to similar items. For example, “I enjoy my job” may be deemed semantically synonymous with “I like my current occupation.” Alternatively, survey designers may opt to repeat an item (or set of items) later in a survey (DeSimone et al., 2015).
3. 最后的建议 Recommendations
3.1 Meade & Craig, 2012的建议
First,we encourage the use of identified responses but not the harsh wording in the instruction set used here.
其二,对于较长的问卷强烈建议纳入诈选题目,或者指定作答题目(e.g., “Respond with ‘strongly agree’ for this item”)—. 建议每50-100题目,插入一道指定作答题。We suggest incorporating approximately one such item in every 50–100 items up to a maximum of three. Respondents may become annoyed at such items if a large number appear (Meade & Craig, 2012).
其三,每一个基于网络的问卷,至少应该纳入一种甄别不认真作答的方法,建议增加一道简单的自我报告题目+作答时间。
We believe that every Internet-based survey research study would benefit from incorporating at least one careless response detection method. we suggest incorporating a simple self-report measure (i.e., “In your honest opinion, should we use your data?”), coupled with a cursory look at response time for outliers. If only post hoc methods are available, then inspection of response time and computation of the Even-Odd Consistency measure are suggested as minimums (Meade & Craig, 2012).
对于一些想要更严格的数据筛选,可以考虑加入指定作答题目+ 三种不同的统计方法 (比如奇数偶数一致性)(Meade & Craig, 2012)
3.2 DeSimone的建议对于研究设计的启示:
一些在我们本科学习到的问卷设计方法依然有用。比如, the use of both positively and neg- atively worded items is required for the assessment (and potential dissuasion) of acquiescent responding (Anastasi, 1988; Ray, 1983). Also, the use of longer scales is associated with higher values for coefficient alpha (Cortina, 1993; Cronbach, 1951; Schmitt, 1996) and higher validities (Credé, Harms, Niehorster & Gaya-Valentine, 2012).
(DeSimone et al., 2015)
3.3 关于数据筛选curran的建议
(Curran, 2016)
3.4 个人建议关于问卷设计
问卷设计阶段
1. 控制问卷长度。如果有100-150 到题目,插入1-2 题指定作答题。太多会干扰作答,甚至会惹怒被试。
2. 最后插入一道自我报告的是否认真作答题目。
3. 问卷纳入积极和消极的变量,比如离职意愿与组织承诺。
4. 如果去企业收集,可以召集被试到办公室完成问卷,减少无关干扰。
5. 记录作答时间(电子问卷即可)。
6. 提供一定的报酬(可以是金钱,也或者是一场关于研究结果的报告)
7. 采用1-2中统计方法筛选数据。比较容易实现的:作答时间,被试平均每个题目作答小于2秒钟,值得关注;此外,连续相同答案个数超过问卷长度的一半,很大可能是随机填写。
8. 如果进行了数据筛选,比较筛选前后分析结果,并在论文进行报告。
4. 数据筛选R中的实现
本文介绍的所有利用统计的比如response time, longstring, 异常值都可以在R包里实现。
Use R packages: careless
https://cran.r-project.org/web/packages/careless/careless.pdf
本文作者:李培凯
首发于:荷兰心理统计联盟
乌特勒支大学工业组织心理学在读博士生,研究兴趣主要在于leadership, meta-analysis, 以及对于工作特征或工作压力的appraisal.
热门跟贴