# Modeling Approach

How we do it

# Design principles

We build our models with four principles in mind:

- INCLUSIVITY - our models should promote access and recognize talent in all places
- INTERPRETABILITY - you should know exactly what our models are doing
- ACCURACY - our models do a good job at predicting specific and general outcomes
- CREATIVITY - we embrace new approaches and novel connections

We are aware that, in some instances, the above principles come into conflict. In these cases, we are proactive and transparent with our users about the decision we made, and our rationale.

# The importance of Skills AND Outcomes

Mapping talent on the basis of skills and outcomes leads to accuracy and inclusivity. That’s why AdeptID balances both techniques in its approach.

## Skills-based models

We look beyond the job title to see the underlying skills a person has developed. This allows us to map the distance between any two jobs based on those underlying skills.

→ This is not widely adopted by the market, but not particularly novel. It is straightforward if a bit simplistic

## Outcomes-based models

We take into account real employment data to surface which skills are associated with successful results.

Using real outcomes makes our models more empirical and frees us from having to adopt a single, rigid taxonomy.

→ This approach is novel and requires a lot of data. If it is unmonitored and lacks the proper safeguards, it can highly susceptible to bias.

## How we define “Skills”

We have an intentionally open-ended definition of “skills”. For us, skills are descriptors of the capacity of individuals that could serve as potential predictors of employment and/or educational outcomes.

To our models, skills are numeric (scalar), predictive features used to describe individuals and occupations. Our sources provide us with >10,000 scalar attributes describing >1,000 distinct occupations. Rather than focusing on a single data source for skills, we use collaborative filtering and singular value decomposition (SVD) to combine skill attributes across a variety of taxonomies into a set of latent skill variables. (see Modeling Techniques for further detail)

## How we define “Outcomes”

Our approach to modeling allows us to train models based on different types of outcomes. The conventional definition of a “successful” outcome is a person getting hired for a job in a new occupation. Our models look at hundreds of thousands of past attempted transitions - people trying and either succeeding or failing to get a job in a new occupation - in order to both predict the success of similar transitions and to understand the contributing factors to that success (which allows us to identify skills gaps and training opportunities).

We recognize that “success” shouldn’t just be defined by getting a job, but by thriving in it (e.g. getting promoted, experiencing steady wage gains, or maintaining good health). In private contexts, we have been able to train models to predict and understand these “higher order” definitions of success. However, the core models available through the Mobility API are associated with this more straightforward measure.

## Data sources

### Skills

- O*NET
- Lightcast

### Outcomes

- Hiring decisions from employers
- Training provider placement “outcomes”

### Macroeconomic / Other contextual

- Occupational Employment and Wage Statistics (OEWS) from the Bureau of Labor Statistics (BLS)
- Integrated Postsecondary Education Data System (IPEDS)
- Current Population Survey (CPS) from the Census Bureau
- National (US) Jobs Postings Feeds (EMSI)

# Modeling Techniques

Our modeling approach uses logistic regression applied to a binary classification problem. We train models on individual-level, historic hiring data from applicant tracking systems (ATS) and outcome data from our API partners. These models predict the likelihood of hiring in a specific occupation given a specific prior work history.

To avoid overfitting to specific sectors of the economy, we currently augment the real outcomes data we have collected from our partners with synthetic data generated using the underlying skill distances between occupations. As we increase the breadth of real outcomes data, we will refresh and/or replace this synthetic data from our models.

To train our models, we construct representation spaces to describe the entities associated with the hiring prediction: talent, employers, training providers (see below). Our initial public-facing models use only the talent representation space described below. Private-context (and future public) models will incorporate additional features for employers and training providers. As we achieve a critical mass of real outcomes data with employer and training provider representation spaces, we will deploy models that include these associated features.

Our current models use ~3,000 predictive features based on occupational skills. The first group of features describe the skills associated with an individual’s prior work history. They are constructed using the average skill values of each of the individuals’ prior jobs. The next set of features describes the skills associated with the target occupation of the transition. A third set of features describes the distances between prior work history and target occupations on the basis of constituent skills. The features in each of these categories represent the variation between occupations across all >10,000 raw skill attributes in our dataset. We achieve this compressed representation using singular value decomposition (SVD) on the matrix of occupations and skills described below (see Talent representation space).

With this model formulation, we train our logistic regression model with regularization. Post-training our model allows us to both make novel predictions on new individuals as well as determine what skills and/or attributes are most important to successfully transitioning into new employment (by investigating the learned model weights).

# Representation spaces

We see this as a three-way matching problem between talent, demand, and training. We therefore need a way to represent each of these entities in our ML models.

A typical technique for representing categorical variables (like occupations, employers, and training providers) is to create 1-hot encodings of these variables. This technique fails to capture the degree of similarity between specific occupations, specific employers and specific training providers. By creating representation spaces for each of these entities, we allow our models to see the relative similarities and differences between entities of the same class. This allows the models to both learn what features of these entities are predictive of success and to make predictions about previously unseen occupations, employers and training providers on the basis of their similarity to others in the training data.

Below we describe the specifics of each entity’s representation space:

## Talent

We represent individuals using the skills associated with the occupations that make up their prior work history. We start with a skill-based description for each of >1,000 occupations in the U.S. economy (as defined by the Bureau of Labor Statistics). For each of >1,000 occupations, we have >10,000 distinct skills describing that occupation (aggregated across a variety of skill taxonomies). This underlying data creates a 1,000 by 10,000 sparse matrix of occupations and skills. (The sparsity of this matrix arises from the fact that not all occupations are completely represented in all taxonomies). We use Collaborative Filtering (CF) to convert our sparse matrix of occupations and skills into a dense one. We then use Singular Value Decomposition (SVD) to reduce the dimensionality of our representation space. We compress our >10,000 skill features into a set of 1,000 latent skill features while preserving >99% of the variance in skills across occupations. These 1,000 latent skill features are the model’s input features described above in Modeling Techniques.

## Employers

We construct a representation space for employers using historic demand as measured by job postings. We convert historic postings data into summaries of demand across geographies (CBSAs), occupations, and time. Using similar dimensionality reduction techniques to those in the talent representation space, we then compress the employer space into a smaller set of latent features which preserve the majority of variance between employers in the set. Beyond demand, we will incorporate other characteristics of employers over time (e.g. workforce practices, business performance, granular workforce composition). We will expose these representation spaces in future iterations of our publicly available models.

## Training Providers

Our initial representation space for training providers focuses on data available in IPEDS (Integrated Postsecondary Education Data System). This dataset allows us to summarize the programs of study, student population in each program (including demographics) as well as associated measures of aggregate student success (i.e. graduation rates etc.). As in other representation spaces, we construct latent variables from these data points that preserve the vast majority of explained variance across training providers.

As with the employer representation spaces, we will expose these training provider features in future iterations of our publicly available models as we become more confident that the volume of historical training data is adequate to add value.

We recognize that the universe of training providers, particularly for vocational training programs, is not adequately captured by the IPEDS dataset, and we are working with partners to use a more complete dataset.

# Algorithmic Bias

Any system that uses historic outcomes to build predictive models is subject to the bias inherent in those historic examples. This fact requires AdeptID (and all developers of such models) to develop tools and processes to prevent algorithmic bias from influencing the predictions made by its algorithms. At AdeptID our approach contains several pillars:

Partnering with mission-aligned organizations (Year Up, Grads of Life, and others)

Continuous monitoring of our algorithms for bias

Implementation of technical solutions to counteract biases present in training data

We are committed to evaluating and integrating all available methods for bias removal into our predictive models. This will always be an area of active research at AdeptID.

## Measuring Disparate Impact

Disparate Impact refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Wherever possible (i.e. in cases where we’ve received demographic data) we monitor our models for disparate impact.

We are actively testing a suite of tools to mitigate disparate impact both internally and with our academic research partners.

Updated 10 months ago