Lindsay Plat
    Quantitative UX: Analysing Usability & Workload of Medical AI

    Quantitative UX: Analysing Usability & Workload of Medical AI

    image

    🔍 Why I Did This

    I revisited a university assignment originally completed using SPSS to strengthen my skills in R (statistical programming), data analysis, and UX-focused data visualisation.

    My goal was to build a portfolio piece that demonstrates how I can evaluate user experience with validated metrics—tools that have been tested to reliably measure usability and workload.

    This was a self-directed project. I reanalysed a dataset where radiologists evaluated two different AI systems used for medical imaging.

    🎯 The Goal

    I wanted to compare the two AI systems on:

    • System Usability – How easy and pleasant the system is to use.
    • Task Load – How mentally demanding the system feels.

    To measure these, I used two well-established UX questionnaires:

    • SUS – System Usability Scale: A short survey with 10 questions rated on a 1–5 scale (higher = better usability).
    • NASA-TLX – Task Load Index: A 6-question survey that assesses mental and physical workload. Originally on a 0–100 scale, I rescaled it to 1–5 for easier comparison.

    🧹 Cleaning & Preparing the Data

    Before analysing, I checked and corrected:

    • Data types and structure – making sure each variable was in the right format.
    • Missing values – ensuring all questions were answered.
    • Out-of-range entries – e.g., scores outside expected scales.

    Key transformations I applied:

    Fix
    Why?
    Result
    Reverse-coded SUS items (2, 4, 6, 8, 10)
    These questions were phrased negatively
    All items now consistently reflect that higher scores = better usability
    Rescaled TLX from 0–100 to 1–5
    TLX scores were ~20x larger than SUS
    Scales became easier to compare and visualise side by side

    📊 What the Data Looked Like

    TLX & SUS Item Distribution

    Each box shows how people scored each question. SUS items (left) clustered around high scores (4–5), suggesting users generally found the systems easy to use. TLX items (right) had lower scores (1–2), indicating low perceived workload. Red dots are outliers—occasional users who reported higher workload.
    Each box shows how people scored each question. SUS items (left) clustered around high scores (4–5), suggesting users generally found the systems easy to use. TLX items (right) had lower scores (1–2), indicating low perceived workload. Red dots are outliers—occasional users who reported higher workload.

    🔄 Combining Scores

    I averaged the SUS and TLX items for each participant. This produced:

    • Usability Score (SUS): overall perception of system ease of use.
    • Workload Score (TLX): overall mental/physical effort required.
    AI System
    SUS Mean (1-5)
    TLX Mean (1-5)
    System 1
    4.18
    1.70
    System 2
    4.40
    1.64
    💡

    Takeaway: Both systems were rated highly usable and low in workload.

    🧪 Testing Reliability

    I tested internal consistency to confirm that, in my respondents’ answers, all the items measured the same idea. A high Cronbach’s alpha shows it’s valid to average them into a single score.

    If a scale is reliable, its items are consistent in measuring the same thing.

    I tested this with Cronbach’s alpha, where higher values (close to 1) mean stronger internal consistency.

    Scale
    Cronbach’s α
    Interpretation
    SUS
    0.84
    ✅ Strong reliability
    TLX
    0.86
    ✅ Strong reliability

    🔗 Do These Scales Measure Different Things?

    I created a correlation matrix showing how each question of the surveys (SUS and TLX) relates to the others.

    Interpretation:

    • High correlations within each scale (SUS questions correlate strongly with each other).
    • Low or negative correlations between SUS and TLX questions.

    Correlation Matrix Heatmap of SUS and TLX items

    Heatmap: Darker colours = stronger correlation
    Heatmap: Darker colours = stronger correlation
    💡

    Takeaway:

    This supports divergent validity—meaning usability and workload are truly different experiences.

    📉 Do the Scores Follow a Normal Distribution?

    Many statistical tests assume data is normally distributed (bell-shaped curve).

    I checked this using Shapiro-Wilk tests, which test whether scores differ significantly from normal.

    • All scores were non-normal (p < 0.001).
    • Therefore, I used non-parametric tests (Wilcoxon) that don’t require normality.

    Distribution SUS and TLX Scores by AI System

    Distributions show System Usability Scale (SUS) scores and rescaled NASA-TLX scores (normalised to a 1–5 range for comparability) by AI system. Higher SUS indicates greater perceived usability; higher TLX indicates greater perceived workload.
    Distributions show System Usability Scale (SUS) scores and rescaled NASA-TLX scores (normalised to a 1–5 range for comparability) by AI system. Higher SUS indicates greater perceived usability; higher TLX indicates greater perceived workload.

    ⚖️ Statistical Comparison Between Systems

    Metric
    p-value
    Conclusion
    TLX Score
    0.353
    ❌ No significant difference in workload
    SUS Score
    0.008
    ✅ Significant difference in usability
    💡

    Takeaway:

    • Only usability differed significantly: System 2 was rated more usable.
    • A p-value < 0.05 means the difference is statistically significant.

    🧠 Summary of UX Insights

    ✅ System 2 was preferred for usability. Users found it easier to use and felt more confident.

    ⚖️ Workload was similar between systems. Neither imposed excessive mental or physical demands.

    📘 What This Shows About My UX Analysis Skills

    This project demonstrates:

    • 📊 Quantitative UX research—using surveys to measure user experience.
    • 🧹 Data cleaning and preparation—ensuring results are accurate and comparable.
    • 🧠 Psychometric analysis—checking reliability and validity of measures.
    • 🖼 Visualisation—creating clear charts for both technical and non-technical audiences.
    • 🧭 Actionable UX insights—translating numbers into findings that inform design decisions.

    💬 Interested in the Details?

    I’m happy to walk you through the R code, share the data preparation steps, or discuss how this approach supports evidence-based UX design.

    Lindsay Plat
    LinkedIn