How It Works

A look inside the scoring engine, from the intuition to the math.

The Big Picture

Your transcript score measures the intellectual breadth of your course selection. It answers the question: how much of the academic landscape have you explored?

We start by converting every course at Illinois into a mathematical representation based on its description. Courses with similar content end up close together; courses covering unrelated topics end up far apart. Think of each course as a point in a vast high-dimensional conceptual space.

Your score reflects the volume that your courses span in this space. A student who takes only CS courses occupies a small corner — those courses cluster tightly together. A student who mixes computer science, music, psychology, and chemistry spans a much larger region.

To make the score intuitive, we normalize it to a 0–100 scale. We compare your volume against two reference points:

  • The maximum — the most diverse transcript possible at your course count, found by a greedy algorithm that picks the most different course at each step.
  • The minimum — a realistic floor representing a student who stays entirely within one department.

A score of 50 means you're halfway between a single-department specialist and the theoretical maximum. The Surprising Pairs section highlights your two most unrelated courses, while Clusters shows groups of courses that are similar to each other.

Try It

The widget below uses real course embeddings. Toggle courses on and off to see how the Gram matrix and volume change. Lower off-diagonal values mean courses are more different, which increases the volume.

Toggle courses to see how the Gram matrix and volume change:

CS 124Intro to Computer Science I

A first course in computer science for majors. Fundamentals of computing, problem solving, and the Java programming language. Introduction to abstraction, object-oriented programming, data structures, and algorithms.

MATH 241Calculus III

Third course in calculus and analytic geometry, including vector analysis. Topics include vectors in two and three dimensions, partial derivatives, multiple integrals, line integrals, surface integrals, and Stokes' theorem.

MUS 130Introduction to Music

A survey of Western art music from the Middle Ages to the present. Students develop active listening skills and study the structural, historical, and cultural contexts of representative works by major composers.

The embedding model reads the full description and captures semantic meaning, not just word overlap. Courses that share boilerplate (“introduction to X”) but cover different subjects still end up far apart. The tight clusters you see come from genuine topical overlap.

Gram matrix G — each cell is the dot product between two course vectors, divided by k=3

CSMATHMUS
CS0.3330.1450.138
MATH0.1450.3330.106
MUS0.1380.1060.333
log det(G) =-3.724(3 courses)

The amber diagonal is each course’s self-similarity — cosine similarity is 1, divided by k, so every diagonal equals 1/k. The off-diagonal cells are what matter for diversity: their colors reflect the raw cosine similarity between course pairs (k-independent), while the displayed numbers are the Gram entries (divided by k), which is why the numbers change as you add or remove courses but the colors stay fixed.

Technical Details

Course Embeddings

Each course is embedded using OpenAI's text-embedding-3-large model, which maps the course title and description to a 3,072-dimensional unit vector. Embeddings are generated per-semester to capture evolving course content, with a fallback to the most recent embedding when a specific semester isn't available. Vectors are stored as float16 to halve database size with negligible accuracy impact (less than 0.02%).

Gram Matrix

Given kk course embedding vectors v1,,vkRd\mathbf{v}_1, \ldots, \mathbf{v}_k \in \mathbb{R}^d, we form the k×kk \times k Gram matrix:

Gij=vivjk+λδijG_{ij} = \frac{\mathbf{v}_i \cdot \mathbf{v}_j}{k} + \lambda \, \delta_{ij}

where λ=106\lambda = 10^{-6} is a ridge regularization term ensuring positive definiteness. The division by kk normalizes the scale so that volumes remain comparable across different transcript sizes.

Volume via Log-Determinant

The diversity score is based on the log-determinant of the Gram matrix, which measures the volume of the parallelepiped spanned by the course vectors:

volume=logdet(G)\text{volume} = \log \det(G)

We compute this efficiently via Cholesky decomposition G=LLG = LL^\top, where LL is lower triangular:

logdet(G)=2i=1klogLii\log \det(G) = 2 \sum_{i=1}^{k} \log L_{ii}

A larger determinant means the vectors point in more diverse directions — the courses cover a wider range of topics.

Score Normalization

Raw volumes are normalized to a 0–100 scale using precomputed reference curves:

score=volumemin[k]max[k]min[k]×100\text{score} = \frac{\text{volume} - \text{min}[k]}{\text{max}[k] - \text{min}[k]} \times 100

Max curve: computed by a greedy algorithm that selects up to 200 courses maximizing the marginal log-determinant increase at each step, using the Schur complement for efficient incremental updates.

Min curve: for each kk, we sample random subsets of kkcourses from each department and use the average volume of the lowest-scoring department. Near-duplicate courses (cosine similarity > 0.90) are removed first to prevent deflated volumes from trivially similar offerings.

Cosine Similarity & Clusters

Pairwise course similarity is measured by cosine similarity:

sim(a,b)=abab\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|}

Surprising pairs are the 5 course pairs with the lowest similarity (highest cosine distance). Clusters are connected components in a graph where an edge exists between courses with similarity ≥ 0.85.

Privacy

Only course codes are sent to the server for scoring. If you upload a PDF transcript, it is parsed entirely in your browser using PDF.js — no transcript data is uploaded to our server. Grades and personal information are never transmitted.