🔍 Why I Did This
I revisited a university assignment originally completed using SPSS to strengthen my skills in R (statistical programming), data analysis, and UX-focused data visualisation.
My goal was to build a portfolio piece that demonstrates how I can evaluate user experience with validated metrics—tools that have been tested to reliably measure usability and workload.
This was a self-directed project. I reanalysed a dataset where radiologists evaluated two different AI systems used for medical imaging.
🎯 The Goal
I wanted to compare the two AI systems on:
- System Usability – How easy and pleasant the system is to use.
- Task Load – How mentally demanding the system feels.
To measure these, I used two well-established UX questionnaires:
- SUS – System Usability Scale: A short survey with 10 questions rated on a 1–5 scale (higher = better usability).
- NASA-TLX – Task Load Index: A 6-question survey that assesses mental and physical workload. Originally on a 0–100 scale, I rescaled it to 1–5 for easier comparison.
🧹 Cleaning & Preparing the Data
Before analysing, I checked and corrected:
- Data types and structure – making sure each variable was in the right format.
- Missing values – ensuring all questions were answered.
- Out-of-range entries – e.g., scores outside expected scales.
Key transformations I applied:
Fix | Why? | Result |
Reverse-coded SUS items (2, 4, 6, 8, 10) | These questions were phrased negatively | All items now consistently reflect that higher scores = better usability |
Rescaled TLX from 0–100 to 1–5 | TLX scores were ~20x larger than SUS | Scales became easier to compare and visualise side by side |
📊 What the Data Looked Like
TLX & SUS Item Distribution
🔄 Combining Scores
I averaged the SUS and TLX items for each participant. This produced:
- Usability Score (SUS): overall perception of system ease of use.
- Workload Score (TLX): overall mental/physical effort required.
AI System | SUS Mean (1-5) | TLX Mean (1-5) |
System 1 | 4.18 | 1.70 |
System 2 | 4.40 | 1.64 |
Takeaway: Both systems were rated highly usable and low in workload.
🧪 Testing Reliability
I tested internal consistency to confirm that, in my respondents’ answers, all the items measured the same idea. A high Cronbach’s alpha shows it’s valid to average them into a single score.
If a scale is reliable, its items are consistent in measuring the same thing.
I tested this with Cronbach’s alpha, where higher values (close to 1) mean stronger internal consistency.
Scale | Cronbach’s α | Interpretation |
SUS | 0.84 | ✅ Strong reliability |
TLX | 0.86 | ✅ Strong reliability |
🔗 Do These Scales Measure Different Things?
I created a correlation matrix showing how each question of the surveys (SUS and TLX) relates to the others.
Interpretation:
- High correlations within each scale (SUS questions correlate strongly with each other).
- Low or negative correlations between SUS and TLX questions.
Correlation Matrix Heatmap of SUS and TLX items
Takeaway:
This supports divergent validity—meaning usability and workload are truly different experiences.
📉 Do the Scores Follow a Normal Distribution?
Many statistical tests assume data is normally distributed (bell-shaped curve).
I checked this using Shapiro-Wilk tests, which test whether scores differ significantly from normal.
- All scores were non-normal (p < 0.001).
- Therefore, I used non-parametric tests (Wilcoxon) that don’t require normality.
Distribution SUS and TLX Scores by AI System
⚖️ Statistical Comparison Between Systems
Metric | p-value | Conclusion |
TLX Score | 0.353 | ❌ No significant difference in workload |
SUS Score | 0.008 | ✅ Significant difference in usability |
Takeaway:
- Only usability differed significantly: System 2 was rated more usable.
- A p-value < 0.05 means the difference is statistically significant.
🧠 Summary of UX Insights
✅ System 2 was preferred for usability. Users found it easier to use and felt more confident.
⚖️ Workload was similar between systems. Neither imposed excessive mental or physical demands.
📘 What This Shows About My UX Analysis Skills
This project demonstrates:
- 📊 Quantitative UX research—using surveys to measure user experience.
- 🧹 Data cleaning and preparation—ensuring results are accurate and comparable.
- 🧠 Psychometric analysis—checking reliability and validity of measures.
- 🖼 Visualisation—creating clear charts for both technical and non-technical audiences.
- 🧭 Actionable UX insights—translating numbers into findings that inform design decisions.
💬 Interested in the Details?
I’m happy to walk you through the R code, share the data preparation steps, or discuss how this approach supports evidence-based UX design.