

TODAY'S ISSUE
TODAY’S DAILY DOSE OF DATA SCIENCE
11 Types of Variables in a Dataset
In any tabular dataset, we typically categorize the columns as either a feature or a target.
However, there are so many variables that one may find/define in their dataset, which I want to discuss today.
These are depicted in the animation below:

Let’s begin!
#1-2) Independent and dependent variables
These are the most common and fundamental to ML.
Independent variables are the features that are used as input to predict the outcome. They are also referred to as predictors/features/explanatory variables.

The dependent variable is the outcome that is being predicted. It is also called the target, response, or output variable.
#3-4) Confounding and correlated variables
Confounding variables are typically found in a cause-and-effect study (causal inference).

These variables are not of primary interest in the cause-and-effect equation but can potentially lead to spurious associations.
To exemplify, say we want to measure the effect of ice cream sales on the sales of air conditioners.

As you may have guessed, these two measurements are highly correlated.
However, there’s a confounding variable — temperature, which influences both ice cream sales and the sales of air conditioners.

To study the true casual impact, it is essential to consider the confounder (temperature). Otherwise, the study will produce misleading results.
In fact, it is due to the confounding variables that we hear the statement: “Correlation does not imply causation.”
In the above example:
- There is a high correlation between ice cream sales and sales of air conditioners.
- But the sales of air conditioners (effect) are NOT caused by ice cream sales.
Also, in this case, the air conditioner and ice cream sales are correlated variables.
More formally, a change in one variable is associated with a change in another.
#5) Control variables
In the above example, to measure the true effect of ice cream sales on air conditioner sales, we must ensure that the temperature remains unchanged throughout the study.

Once controlled, temperature becomes a control variable.
More formally, these are variables that are not the primary focus of the study but are crucial to account for to ensure that the effect we intend to measure is not biased or confounded by other factors.
#6) Latent variables
A variable that is not directly observed but is inferred from other observed variables.
For instance, we use clustering algorithms because the true labels do not exist, and we want to infer them somehow.

The true label is a latent variable in this case.
Another common example of a latent variable is “intelligence.”
Intelligence itself cannot be directly measured; it is a latent variable.
However, we can infer intelligence through various observable indicators such as test scores, problem-solving abilities, and memory retention.
We also learned about Latent variables when we studied Gaussian mixture models if you remember.
#7) Interaction variables
As the name suggests, these variables represent the interaction effect between two or more variables, and are often used in regression analysis.
Here’s an instance I remember using them in.
In a project, I studied the impact of population density and income levels on spending behavior.
- I created three groups for population density — HIGH, MEDIUM, and LOW (one-hot encoded).
- Likewise, I created three groups for income levels — HIGH, MEDIUM, and LOW (one-hot encoded).

To do regression analysis, I created interaction variables by cross-multiplying both one-hot columns.
This produced 9 interaction variables:
- Population-High and Income-High
- Population-High and Income-Med
- Population-High and Income-Low
- Population-Med and Income-High
- and so on…
Conducting the regression analysis on interaction variables revealed more useful insights than what I observed without them.
To summarize, the core idea is to study two or more variables together rather than independently.
#8-9) Stationary and Non-Stationary variables:
The concept of stationarity often appears in time-series analysis.
Stationary variables are those whose statistical properties (mean, variance) DO NOT change over time.

On the flip side, if a variable’s statistical properties change over time, they are called non-stationary variables.
Preserving stationarity in statistical learning is critical because these models are fundamentally reliant on the assumption that samples are identically distributed.
But if the probability distribution of variables is evolving over time, (non-stationary), the above assumption gets violated.
That is why, typically, using direct values of the non-stationary feature (like the absolute value of the stock price) is not recommended.
Instead, I have always found it better to define features in terms of relative changes:
$$ \frac{\delta P}{P} \rightarrow \text{relative change in stock price} $$
#10) Lagged variables
Talking of time series, lagged variables are pretty commonly used in feature engineering and data analytics.
As the name suggests, a lagged variable represents previous time points’ values of a given variable, essentially shifting the data series by a specified number of periods/rows.

For instance, when predicting next month’s sales figures, we might include the sales figures from the previous month as a lagged variable.
Lagged features may include:
- 7-day lag on website traffic to predict current website traffic.
- 30-day lag on stock prices to predict the next month’s closing prices.
- And so on…
#11) Leaky variables
Yet again, as the name suggests, these variables (unintentionally) provide information about the target variable that would not be available at the time of prediction.
This leads to overly optimistic model performance during training but fails to generalize to new data.

I recently talked about leaky variable(s) in this newsletter through random splitting.
To reiterate, consider a dataset containing medical imaging data.
Each sample consists of multiple images (e.g., different views of the same patient’s body part), and the model is intended to detect the severity of a disease.

In this case, randomly splitting the images into train and test sets will result in data leakage.
This is because images of the same patient will end up in both the training and test sets, allowing the model to “see” information from the same patient during training and testing.
Here’s a paper which committed this mistake (and later corrected it):

To avoid this, a patient must only belong to the test or train/val set, not both.
This is called group splitting:

Creating forward-lag features is another way leaky variables get created unintentionally at times:

That’s it.
From the above discussion, it is pretty clear that there is a whole world of variables beyond features, targets, categorical and numerical variables, etc.
Of course, there are a few more types of variables that I haven’t covered here, as I intend to cover them in another issue.
But till then, can you tell me which ones I have missed?
Thanks for reading!
IN CASE YOU MISSED IT
​Prompting vs. RAG vs. Fine-tuning​
ROADMAP
From local ML to production ML
THAT'S A WRAP
No-Fluff Industry ML resources to
Succeed in DS/ML roles

At the end of the day, all businesses care about impact. That’s it!
- Can you reduce costs?
- Drive revenue?
- Can you scale ML models?
- Predict trends before they happen?
We have discussed several other topics (with implementations) in the past that align with such topics.
Here are some of them:
- Learn sophisticated graph architectures and how to train them on graph data in this crash course.
- So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
- Run large models on small devices using Quantization techniques.
- Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
- Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
- Learn how to scale and implement ML model training in this practical guide.
- Learn 5 techniques with implementation to reliably test ML models in production.
- Learn how to build and implement privacy-first ML systems using Federated Learning.
- Learn 6 techniques with implementation to compress ML models.
All these resources will help you cultivate key skills that businesses and companies care about the most.
SPONSOR US
Advertise to 600k+ data professionals
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.