11 Types of Variables in a Dataset
Beyond features, targets, etc.
Beyond features, targets, etc.

TODAY'S ISSUE
In any tabular dataset, we typically categorize the columns as either a feature or a target.
However, there are so many variables that one may find/define in their dataset, which I want to discuss today.
These are depicted in the animation below:

Let’s begin!
These are the most common and fundamental to ML.
Independent variables are the features that are used as input to predict the outcome. They are also referred to as predictors/features/explanatory variables.

The dependent variable is the outcome that is being predicted. It is also called the target, response, or output variable.
Confounding variables are typically found in a cause-and-effect study (causal inference).

These variables are not of primary interest in the cause-and-effect equation but can potentially lead to spurious associations.
To exemplify, say we want to measure the effect of ice cream sales on the sales of air conditioners.

As you may have guessed, these two measurements are highly correlated.
However, there’s a confounding variable — temperature, which influences both ice cream sales and the sales of air conditioners.

To study the true casual impact, it is essential to consider the confounder (temperature). Otherwise, the study will produce misleading results.
In fact, it is due to the confounding variables that we hear the statement: “Correlation does not imply causation.”
In the above example:
Also, in this case, the air conditioner and ice cream sales are correlated variables.
More formally, a change in one variable is associated with a change in another.
In the above example, to measure the true effect of ice cream sales on air conditioner sales, we must ensure that the temperature remains unchanged throughout the study.

Once controlled, temperature becomes a control variable.
More formally, these are variables that are not the primary focus of the study but are crucial to account for to ensure that the effect we intend to measure is not biased or confounded by other factors.
A variable that is not directly observed but is inferred from other observed variables.
For instance, we use clustering algorithms because the true labels do not exist, and we want to infer them somehow.

The true label is a latent variable in this case.
Another common example of a latent variable is “intelligence.”
Intelligence itself cannot be directly measured; it is a latent variable.
However, we can infer intelligence through various observable indicators such as test scores, problem-solving abilities, and memory retention.
We also learned about Latent variables when we studied Gaussian mixture models if you remember.
As the name suggests, these variables represent the interaction effect between two or more variables, and are often used in regression analysis.
Here’s an instance I remember using them in.
In a project, I studied the impact of population density and income levels on spending behavior.

To do regression analysis, I created interaction variables by cross-multiplying both one-hot columns.
This produced 9 interaction variables:
Conducting the regression analysis on interaction variables revealed more useful insights than what I observed without them.
To summarize, the core idea is to study two or more variables together rather than independently.
The concept of stationarity often appears in time-series analysis.
Stationary variables are those whose statistical properties (mean, variance) DO NOT change over time.

On the flip side, if a variable’s statistical properties change over time, they are called non-stationary variables.
Preserving stationarity in statistical learning is critical because these models are fundamentally reliant on the assumption that samples are identically distributed.
But if the probability distribution of variables is evolving over time, (non-stationary), the above assumption gets violated.
That is why, typically, using direct values of the non-stationary feature (like the absolute value of the stock price) is not recommended.
Instead, I have always found it better to define features in terms of relative changes:
$$ \frac{\delta P}{P} \rightarrow \text{relative change in stock price} $$
Talking of time series, lagged variables are pretty commonly used in feature engineering and data analytics.
As the name suggests, a lagged variable represents previous time points’ values of a given variable, essentially shifting the data series by a specified number of periods/rows.

For instance, when predicting next month’s sales figures, we might include the sales figures from the previous month as a lagged variable.
Lagged features may include:
Yet again, as the name suggests, these variables (unintentionally) provide information about the target variable that would not be available at the time of prediction.
This leads to overly optimistic model performance during training but fails to generalize to new data.

I recently talked about leaky variable(s) in this newsletter through random splitting.
To reiterate, consider a dataset containing medical imaging data.
Each sample consists of multiple images (e.g., different views of the same patient’s body part), and the model is intended to detect the severity of a disease.

In this case, randomly splitting the images into train and test sets will result in data leakage.
This is because images of the same patient will end up in both the training and test sets, allowing the model to “see” information from the same patient during training and testing.
Here’s a paper which committed this mistake (and later corrected it):

To avoid this, a patient must only belong to the test or train/val set, not both.
This is called group splitting:

Creating forward-lag features is another way leaky variables get created unintentionally at times:

That’s it.
From the above discussion, it is pretty clear that there is a whole world of variables beyond features, targets, categorical and numerical variables, etc.
Of course, there are a few more types of variables that I haven’t covered here, as I intend to cover them in another issue.
But till then, can you tell me which ones I have missed?
Thanks for reading!