Quantitative UX: Analysing Usability & Workload of Medical AI

🔍 Why I Did This

I revisited a university assignment originally completed using SPSS to strengthen my skills in R (statistical programming), data analysis, and UX-focused data visualisation.

My goal was to build a portfolio piece that demonstrates how I can evaluate user experience with validated metrics—tools that have been tested to reliably measure usability and workload.

This was a self-directed project. I reanalysed a dataset where radiologists evaluated two different AI systems used for medical imaging.

🎯 The Goal

I wanted to compare the two AI systems on:

System Usability – How easy and pleasant the system is to use.
Task Load – How mentally demanding the system feels.

To measure these, I used two well-established UX questionnaires:

SUS – System Usability Scale: A short survey with 10 questions rated on a 1–5 scale (higher = better usability).
NASA-TLX – Task Load Index: A 6-question survey that assesses mental and physical workload. Originally on a 0–100 scale, I rescaled it to 1–5 for easier comparison.

🧹 Cleaning & Preparing the Data

Before analysing, I checked and corrected:

Data types and structure – making sure each variable was in the right format.
Missing values – ensuring all questions were answered.
Out-of-range entries – e.g., scores outside expected scales.

Key transformations I applied:

Fix	Why?	Result
Reverse-coded SUS items (2, 4, 6, 8, 10)	These questions were phrased negatively	All items now consistently reflect that higher scores = better usability
Rescaled TLX from 0–100 to 1–5	TLX scores were ~20x larger than SUS	Scales became easier to compare and visualise side by side

📊 What the Data Looked Like

TLX & SUS Item Distribution

Each box shows how people scored each question. SUS items (left) clustered around high scores (4–5), suggesting users generally found the systems easy to use. TLX items (right) had lower scores (1–2), indicating low perceived workload. Red dots are outliers—occasional users who reported higher workload.

🔄 Combining Scores

I averaged the SUS and TLX items for each participant. This produced:

Usability Score (SUS): overall perception of system ease of use.
Workload Score (TLX): overall mental/physical effort required.

AI System	SUS Mean (1-5)	TLX Mean (1-5)
System 1	4.18	1.70
System 2	4.40	1.64

💡

Takeaway: Both systems were rated highly usable and low in workload.

🧪 Testing Reliability

I tested internal consistency to confirm that, in my respondents’ answers, all the items measured the same idea. A high Cronbach’s alpha shows it’s valid to average them into a single score.

If a scale is reliable, its items are consistent in measuring the same thing.

I tested this with Cronbach’s alpha, where higher values (close to 1) mean stronger internal consistency.

Scale	Cronbach’s α	Interpretation
SUS	0.84	✅ Strong reliability
TLX	0.86	✅ Strong reliability

🔗 Do These Scales Measure Different Things?

I created a correlation matrix showing how each question of the surveys (SUS and TLX) relates to the others.

Interpretation:

High correlations within each scale (SUS questions correlate strongly with each other).
Low or negative correlations between SUS and TLX questions.

Correlation Matrix Heatmap of SUS and TLX items

Heatmap: Darker colours = stronger correlation

💡

Takeaway:

This supports divergent validity—meaning usability and workload are truly different experiences.

📉 Do the Scores Follow a Normal Distribution?

Many statistical tests assume data is normally distributed (bell-shaped curve).

I checked this using Shapiro-Wilk tests, which test whether scores differ significantly from normal.

All scores were non-normal (p < 0.001).
Therefore, I used non-parametric tests (Wilcoxon) that don’t require normality.

Distribution SUS and TLX Scores by AI System

Distributions show System Usability Scale (SUS) scores and rescaled NASA-TLX scores (normalised to a 1–5 range for comparability) by AI system. Higher SUS indicates greater perceived usability; higher TLX indicates greater perceived workload.

⚖️ Statistical Comparison Between Systems

Metric	p-value	Conclusion
TLX Score	0.353	❌ No significant difference in workload
SUS Score	0.008	✅ Significant difference in usability

💡

Takeaway:

Only usability differed significantly: System 2 was rated more usable.
A p-value < 0.05 means the difference is statistically significant.

🧠 Summary of UX Insights

✅ System 2 was preferred for usability. Users found it easier to use and felt more confident.

⚖️ Workload was similar between systems. Neither imposed excessive mental or physical demands.

📘 What This Shows About My UX Analysis Skills

This project demonstrates:

📊 Quantitative UX research—using surveys to measure user experience.
🧹 Data cleaning and preparation—ensuring results are accurate and comparable.
🧠 Psychometric analysis—checking reliability and validity of measures.
🖼 Visualisation—creating clear charts for both technical and non-technical audiences.
🧭 Actionable UX insights—translating numbers into findings that inform design decisions.

💬 Interested in the Details?

I’m happy to walk you through the R code, share the data preparation steps, or discuss how this approach supports evidence-based UX design.