Awesome-Agent-Skills-for-Empirical-Research questionnaire-design-guide

Questionnaire and survey design with Likert scales and coding

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/analysis/wrangling/questionnaire-design-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-questionnaire-des && rm -rf "$T"
manifest: skills/43-wentorai-research-plugins/skills/analysis/wrangling/questionnaire-design-guide/SKILL.md
source content

Questionnaire Design Guide

Design valid and reliable survey instruments with proper question types, Likert scale construction, response coding, and data preparation for analysis.

Survey Design Principles

Question Types

TypeExampleBest ForAnalysis
Likert scale"Rate your agreement: 1-5"Attitudes, perceptionsOrdinal/interval statistics
Multiple choice"Select your field"Demographics, categoriesFrequencies, chi-square
Ranking"Rank these 5 options"Preferences, prioritiesRank correlations
Open-ended"Describe your experience"Exploratory, rich dataQualitative coding
Matrix/gridMultiple items, same scaleEfficient battery of itemsFactor analysis, reliability
Slider/VAS0-100 visual analog scaleContinuous measuresParametric statistics
Semantic differential"Easy __ __ __ __ __ Difficult"Bipolar attitudesFactor analysis

The Four C's of Good Questions

  1. Clear: Avoid jargon, double-barreled questions, and ambiguity
  2. Concise: Keep questions short (ideally under 20 words)
  3. Complete: Include all relevant response options
  4. Consistent: Use the same scale direction and format throughout

Likert Scale Design

Scale Points

PointsScale ExampleRecommended Use
4-pointStrongly Disagree to Strongly AgreeForces choice (no neutral), less discriminating
5-pointSD, D, Neutral, A, SAMost common, good balance of simplicity and discrimination
7-pointSD, D, Somewhat D, Neutral, Somewhat A, A, SAMore discriminating, better for experienced respondents
11-point (0-10)Not at all to CompletelyNPS, continuous-like measures

Anchoring Labels

5-Point Agreement Scale:
1 = Strongly Disagree
2 = Disagree
3 = Neither Agree nor Disagree
4 = Agree
5 = Strongly Agree

5-Point Frequency Scale:
1 = Never
2 = Rarely
3 = Sometimes
4 = Often
5 = Always

5-Point Satisfaction Scale:
1 = Very Dissatisfied
2 = Dissatisfied
3 = Neutral
4 = Satisfied
5 = Very Satisfied

Reverse-Coded Items

Include 2-3 reverse-coded items per construct to detect acquiescence bias:

Regular:  "I find research methods interesting."        (1-5: SD to SA)
Reversed: "I find research methods tedious and dull."   (1-5: SD to SA)

# Recode reversed items before analysis:
# reversed_score = (max_scale + 1) - raw_score
# For a 5-point scale: reversed_score = 6 - raw_score

Constructing a Multi-Item Scale

Step-by-Step Process

  1. Define the construct: Write a clear conceptual definition
  2. Generate items: Write 1.5-2x the number of items you plan to keep (e.g., write 15 items for an 8-item scale)
  3. Expert review: Have 3-5 experts rate each item for relevance (Content Validity Index)
  4. Pilot test: Administer to 30-50 respondents
  5. Item analysis: Calculate item-total correlations, check reliability
  6. Exploratory Factor Analysis (EFA): Confirm dimensionality
  7. Finalize scale: Remove weak items, re-test reliability

Example: Research Self-Efficacy Scale

Construct: Belief in one's ability to conduct academic research

Items (5-point Likert, Strongly Disagree to Strongly Agree):
RSE1: I can formulate clear research questions.
RSE2: I can design an appropriate research methodology.
RSE3: I can analyze data using statistical software.
RSE4: I can write a publishable research paper.
RSE5: I can critically evaluate published research.
RSE6: I can present research findings at a conference.
RSE7R: I struggle to interpret statistical results. [REVERSED]
RSE8R: I find it difficult to synthesize literature. [REVERSED]

Data Coding and Preparation

Coding Scheme

import pandas as pd
import numpy as np

# Define coding scheme
likert_coding = {
    "Strongly Disagree": 1,
    "Disagree": 2,
    "Neither Agree nor Disagree": 3,
    "Agree": 4,
    "Strongly Agree": 5
}

# Apply coding
df["Q1_coded"] = df["Q1_raw"].map(likert_coding)

# Reverse code specific items
reverse_items = ["RSE7R", "RSE8R"]
max_scale = 5
for item in reverse_items:
    df[f"{item}_recoded"] = (max_scale + 1) - df[item]

# Calculate composite score (mean of items)
scale_items = ["RSE1", "RSE2", "RSE3", "RSE4", "RSE5", "RSE6",
               "RSE7R_recoded", "RSE8R_recoded"]
df["RSE_mean"] = df[scale_items].mean(axis=1)

Missing Data Handling

# Check missing data patterns
print(df[scale_items].isnull().sum())
print(f"Complete cases: {df[scale_items].dropna().shape[0]} / {df.shape[0]}")

# Common strategies:
# 1. Listwise deletion (if < 5% missing)
df_complete = df.dropna(subset=scale_items)

# 2. Mean imputation per item (simple but biased)
df[scale_items] = df[scale_items].fillna(df[scale_items].mean())

# 3. Person-mean imputation (if < 20% of items missing per person)
def person_mean_impute(row, items, max_missing=2):
    if row[items].isnull().sum() <= max_missing:
        return row[items].fillna(row[items].mean())
    return row[items]  # leave as NaN if too many missing

df[scale_items] = df.apply(lambda r: person_mean_impute(r, scale_items), axis=1)

Reliability Analysis

Cronbach's Alpha

import pingouin as pg

# Calculate Cronbach's alpha
alpha = pg.cronbach_alpha(df[scale_items])
print(f"Cronbach's alpha: {alpha[0]:.3f}")
# Interpretation: >= 0.70 acceptable, >= 0.80 good, >= 0.90 excellent
library(psych)

# Cronbach's alpha with item-level diagnostics
alpha_result <- alpha(data[, scale_items])
print(alpha_result)
# Check "raw_alpha if item dropped" to identify weak items

Item-Total Correlations

# Corrected item-total correlations (should be > 0.30)
item_stats <- alpha_result$item.stats
print(item_stats[, c("r.drop", "raw.alpha")])
# r.drop < 0.30: consider removing the item
# raw.alpha increases if dropped: item is weakening the scale

Validity Assessment

Validity TypeMethodCriterion
Content validityExpert panel rating (CVI)I-CVI >= 0.78, S-CVI/Ave >= 0.90
Construct validityExploratory Factor Analysis (EFA)Eigenvalue > 1, loadings > 0.40
Convergent validityCorrelation with related constructr > 0.30
Discriminant validityCorrelation with unrelated constructr < 0.30
Criterion validityCorrelation with external criterionSignificant correlation
Test-retest reliabilityICC or Pearson r over 2-4 weeksICC > 0.70

Common Design Mistakes

MistakeExampleFix
Double-barreled question"This course is interesting and useful"Split into two separate items
Leading question"Don't you agree that X is important?""How important is X to you?"
Absolute terms"Do you always check citations?""How often do you check citations?"
Missing optionNo "Not Applicable" when neededAdd N/A option or filter logic
Inconsistent scale directionSome items 1=good, others 1=badStandardize direction; clearly mark reversed items
Too many items100-item surveyAim for 5-8 items per construct, 15-30 min total
No pilot testSkip straight to full deploymentAlways pilot with 30-50 respondents

Survey Platform Comparison

PlatformCostFeaturesBest For
QualtricsInstitutionalAdvanced logic, panels, APILarge academic studies
SurveyMonkeyFreemiumEasy to use, basic analysisQuick surveys
Google FormsFreeSimple, integrates with SheetsClassroom, pilot testing
LimeSurveyFree/self-hostedOpen source, full controlPrivacy-sensitive research
REDCapFree (academic)Clinical data, HIPAA compliantMedical/clinical research
ProlificPer-responseParticipant recruitmentOnline experiments