Creating Weighted Samples using pandas.DataFrame

Last updated on May 13, 2021

In statistical analysis, using weights to increase or decrease the relative importance of an item in a population is common. In real life, this has much application, particularly when calculating a weighted average. In this post, we will explore the concept and idea behind weights and also how to implement them using a pandas dataframe object.

Weight & Weighting factor

A weight or statistical weight is the amount given to increase or decrease the relative importance of an item in a data set. When it is assigned to a particular data point, it is called the weighting factor of that data point.

A weighting factor is just a fractional number that will affect the probability of that particular item being chosen. The sum of all probabilities is 1. Hence, the sum of all the weights should equal 1.

Weighting function

A weighting function allows us to allocate more weight for some data points compared to the other. This is handy when we know one part of our data set might be more accurate than the other.

For example, if we were to make a sample of the population of our country, we might have an equal probability of choosing a male or a female. But we know for a fact that there is only 49% male while there is 51% female. So if we choose a random sample with an equal probability of selecting either a male or a female, our sample will not represent the real population. For that, we will give 49% weight to the male class and 51% weight to the female class. Now we will be creating a sample that is by rule representative of the original population.

If f(a) is the function that makes the sample of the population and w(a) is the weighting function, then f(a) x w(a) will return the weighted sample.

Creating weighted samples using pandas.DataFrame

It is a simple task to create a weighted sample in pandas. The inbuilt .sample() accepts weights as a parameter that will automatically assign weights to the corresponding item. here is a sample program.

import pandas as pd

weightage = [0.1, 0.1, 0.2, 0.2, 0.4]

students = {'Name':['Tony', 'Jake', 'Sullivan', 'Peter', 'Emma'],
		'Age': [14, 14, 15, 17, 16],
		'Favorite subject': ['Maths', 'General Science', 'Social Studies', 'English Literature', 'Computer Science'],
		'Hobby': ['Writing', 'Gardening', 'Coin collection', 'Reading', 'Programming']}

df = pd.DataFrame(students)
weighted_sample = df.sample(weights=weightage)
print(weighted_sample)

Output:

   Name  Age  Favorite subject        Hobby
4  Emma   16  Computer Science  Programming

By default, .sample() will only output one row. We can see that 40% weightage was given to the to the last row and it was selected. Now will try increasing the number of rows.

import pandas as pd

weightage = [0.1, 0.1, 0.2, 0.2, 0.4]

students = {'Name':['Tony', 'Jake', 'Sullivan', 'Peter', 'Emma'],
		'Age': [14, 14, 15, 17, 16],
		'Favorite subject': ['Maths', 'General Science', 'Social Studies', 'English Literature', 'Computer Science'],
		'Hobby': ['Writing', 'Gardening', 'Coin collection', 'Reading', 'Programming']}

df = pd.DataFrame(students)
weighted_sample = df.sample(n=3, weights=weightage)
print(weighted_sample)

Output:

       Name  Age    Favorite subject            Hobby
4      Emma   16    Computer Science      Programming
2  Sullivan   15      Social Studies  Coin collection
3     Peter   17  English Literature          Reading

The above output is not a given. Since these three values had 40%, 20%, and 20% chance, they were more likely to be selected over the first two values. But it’s only a matter of probability. We can run the program multiple times and get opposing results.

Using column as weights

The above example might be useful for a short set of data where we can define weights of each item corresponding to their index. But normally We may be required to have weight over some particular feature of our data. In the below example, we use age to create our sample.

import pandas as pd

students = {'Name':['Tony', 'Jake', 'Sullivan', 'Peter', 'Emma'],
		'Age': [14, 14, 15, 17, 16],
		'Favorite subject': ['Maths', 'General Science', 'Social Studies', 'English Literature', 'Computer Science'],
		'Hobby': ['Writing', 'Gardening', 'Coin collection', 'Reading', 'Programming']}

df = pd.DataFrame(students)
weighted_sample = df.sample(n=3, weights="Age")
print(weighted_sample)

In this case, students with higher age will be more likely to be sampled than the other.

Drawback

If there is no weight, all items have an equal probability of being selected. But if the weights are present, the probability is changed. Weighting is done by repeating the number of rows. Rows with more weight will be repeated more than the ones with less weight. That comes with a major drawback.

Rows can only be repeated a definite number of times. Probabilities are fractional. So, the weights we give are rounded to a proper integer that describes the number of times the row is repeated. This will create some error where probability can be greater or smaller than the given weight depending on how it is rounded.

Written by Aravind Sanjeev, an India-based blogger and web developer. Read all his posts. You can also find him on twitter.