How It Works
A look inside the scoring engine, from the intuition to the math.
The Big Picture
Your transcript score measures the intellectual breadth of your course selection. It answers the question: how much of the academic landscape have you explored?
We start by converting every course at Illinois into a mathematical representation based on its description. Courses with similar content end up close together; courses covering unrelated topics end up far apart. Think of each course as a point in a vast high-dimensional conceptual space.
Your score reflects the volume that your courses span in this space. A student who takes only CS courses occupies a small corner — those courses cluster tightly together. A student who mixes computer science, music, psychology, and chemistry spans a much larger region.
To make the score intuitive, we normalize it to a 0–100 scale. We compare your volume against two reference points:
- The maximum — the most diverse transcript possible at your course count, found by a greedy algorithm that picks the most different course at each step.
- The minimum — a realistic floor representing a student who stays entirely within one department.
A score of 50 means you're halfway between a single-department specialist and the theoretical maximum. The Surprising Pairs section highlights your two most unrelated courses, while Clusters shows groups of courses that are similar to each other.
Try It
The widget below uses real course embeddings. Toggle courses on and off to see how the Gram matrix and volume change. Lower off-diagonal values mean courses are more different, which increases the volume.
Toggle courses to see how the Gram matrix and volume change:
A first course in computer science for majors. Fundamentals of computing, problem solving, and the Java programming language. Introduction to abstraction, object-oriented programming, data structures, and algorithms.
Third course in calculus and analytic geometry, including vector analysis. Topics include vectors in two and three dimensions, partial derivatives, multiple integrals, line integrals, surface integrals, and Stokes' theorem.
A survey of Western art music from the Middle Ages to the present. Students develop active listening skills and study the structural, historical, and cultural contexts of representative works by major composers.
The embedding model reads the full description and captures semantic meaning, not just word overlap. Courses that share boilerplate (“introduction to X”) but cover different subjects still end up far apart. The tight clusters you see come from genuine topical overlap.
Gram matrix G — each cell is the dot product between two course vectors, divided by k=3
| CS | MATH | MUS | |
|---|---|---|---|
| CS | 0.333 | 0.145 | 0.138 |
| MATH | 0.145 | 0.333 | 0.106 |
| MUS | 0.138 | 0.106 | 0.333 |
The amber diagonal is each course’s self-similarity — cosine similarity is 1, divided by k, so every diagonal equals 1/k. The off-diagonal cells are what matter for diversity: their colors reflect the raw cosine similarity between course pairs (k-independent), while the displayed numbers are the Gram entries (divided by k), which is why the numbers change as you add or remove courses but the colors stay fixed.
Technical Details
Course Embeddings
Each course is embedded using OpenAI's text-embedding-3-large model, which maps the course title and description to a 3,072-dimensional unit vector. Embeddings are generated per-semester to capture evolving course content, with a fallback to the most recent embedding when a specific semester isn't available. Vectors are stored as float16 to halve database size with negligible accuracy impact (less than 0.02%).
Gram Matrix
Given course embedding vectors , we form the Gram matrix:
where is a ridge regularization term ensuring positive definiteness. The division by normalizes the scale so that volumes remain comparable across different transcript sizes.
Volume via Log-Determinant
The diversity score is based on the log-determinant of the Gram matrix, which measures the volume of the parallelepiped spanned by the course vectors:
We compute this efficiently via Cholesky decomposition , where is lower triangular:
A larger determinant means the vectors point in more diverse directions — the courses cover a wider range of topics.
Score Normalization
Raw volumes are normalized to a 0–100 scale using precomputed reference curves:
Max curve: computed by a greedy algorithm that selects up to 200 courses maximizing the marginal log-determinant increase at each step, using the Schur complement for efficient incremental updates.
Min curve: for each , we sample random subsets of courses from each department and use the average volume of the lowest-scoring department. Near-duplicate courses (cosine similarity > 0.90) are removed first to prevent deflated volumes from trivially similar offerings.
Cosine Similarity & Clusters
Pairwise course similarity is measured by cosine similarity:
Surprising pairs are the 5 course pairs with the lowest similarity (highest cosine distance). Clusters are connected components in a graph where an edge exists between courses with similarity ≥ 0.85.
Privacy
Only course codes are sent to the server for scoring. If you upload a PDF transcript, it is parsed entirely in your browser using PDF.js — no transcript data is uploaded to our server. Grades and personal information are never transmitted.