Introduction

We are often dealing with large amounts of data. Hence, it is best to sample our data before drawing inferences. One of the most widespread method to create samples is by using simple random sampling. Today, we will discuss ways to accomplish that in python. We will be creating random samples from sequences in python but also in pandas.dataframe object which is handy for data science.

Objectives

  • Learn how to sample data from a python class like list, tuple, string, and set.
  • Learn how to sample data from Pandas DataFrame.

What is random sample?

A random sample means just as it sounds. We are creating a sample from our data set. Each item in the sample is selected completely randomly. It is easiest and most useful way to sample our data set.

Why random sample?

Generating a random sample is practically easy and it also helps us avoid selection bias. When we are sampling a data set, we must make sure that our sample has all the qualities necessary to be a representative of the whole data set. A simple random sample is the easiest and the best way to achieve the same. It is also easy to produce another random sample from the same data set for comparison.

Creating random samples from list, tuple, string, or set

In python, list, tuple, and string are treated as a sequence of data. That means we can use an index to access its values. A set can store multiple values but there is no proper order and the values cannot be repeated. We can use all four data types to generate a sample using random.sample() method.

Creating random sample from a list of numbers

STEP 1: Importing random library

import random

STEP 2: Creating a list

numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

STEP 3: Generating a random number

random_number = random.sample(numbers, 1)
print(random_number)

Output:

[2]

We can change the value 1 inside the random.sample() to our desired amount. If we put 2, it will output 2 random numbers from the list as shown below.

import random

numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

random_number = random.sample(numbers, 2)

print(random_number)

Output:

[10, 8]

We can replace the list a with a tuple or a set and steps will be the same.

import random

# declaring list, tuple, and set
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
numbers_tuple = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
numbers_set = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

# selecting a random number
random_number = random.sample(numbers, 1)
random_number_tuple = random.sample(numbers_tuple, 1)
random_number_set = random.sample(numbers_set, 1)

# printing the output
print(random_number)
print(random_number_tuple)
print(random_number_set)

Output:

[3]

[8]

[7]

Please note: If we are using python 3.9 and we’re using random.sample() method on a set, we will get a deprecation warning. It still works but it is not a recommended practice. The method will probably no more work for sets in future python versions.

Creating random sample from a string

The step by step process is pretty much the same. The code will look as follows.

import random
name = "ABRAHAM LINCOLN"
random_letter = random.sample(name, 1)
print(random_letter)

Output:

['H']

Handling exceptions

If we put a sample size that is greater than the size of the sequence (or a negative number), it will result in a traceback. For example.

import random
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
random_number = random.sample(numbers, 11)
print(random_number)

ValueError: Sample larger than population or is negative

Remember we are creating samples in order to understand a larger population. So it makes no sense to have a sample size greater than the population itself. This sounds like a silly mistake, but it can cause our program to crash. We can use try/except to avoid it.

import random

numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

try:
    random_number = random.sample(numbers, 11)
    print(random_number)

except:
    print("The given sample size is bigger than the population or the number is negative")

Output:

The given sample size is bigger than the population or the number is negative

Creating random samples from Pandas Dataframe

If we’re working on a data science project, chances are we are required to create a random sample from a pandas dataframe. Just like Python, Pandas library has also made it simple with an inbuilt sampling method.

STEP 1: Open the command prompt and install pandas

pip install pandas

STEP 2: Importing pandas

import pandas as pd

STEP 3: Creating a dictionary for building our dataframe

students = {'Name':['Tony', 'Jake', 'Sullivan', 'Peter', 'Emma'],
		'Age': [14, 14, 15, 17, 16],
		'Favorite subject': ['Maths', 'General Science', 'Social Studies', 'English Literature', 'Computer Science'],
		'Hobby': ['Writing', 'Gardening', 'Coin collection', 'Reading', 'Programming']}

STEP 4: Converting to dataframe

df = pd.DataFrame(students)

print(df)

Output:

       Name  Age    Favorite subject            Hobby
0      Tony   14               Maths          Writing
1      Jake   14     General Science        Gardening
2  Sullivan   15      Social Studies  Coin collection
3     Peter   17  English Literature          Reading
4      Emma   16    Computer Science      Programming

Now we have our dataframe set. What we need to is pick a random sample. It is actually surprising how easy it is.

STEP 4: Creating random sample

random_sample = df.sample()
print(random_sample)

Output:

       Name  Age Favorite subject            Hobby
2  Sullivan   15   Social Studies  Coin collection

Simply writing .sample() will pick a single random row and output. To get a fixed number of rows, do the following.

random_sample = df.sample(n=3)
print(random_sample)

Output:

       Name  Age    Favorite subject            Hobby
3     Peter   17  English Literature          Reading
0      Tony   14               Maths          Writing
2  Sullivan   15      Social Studies  Coin collection

Here it outputs 3 rows. We can also express in terms of fraction. In the below example, we use 30% of our data as sample.

random_sample = df.sample(frac=0.3)
print(random_sample)
    Name  Age    Favorite subject      Hobby
3  Peter   17  English Literature    Reading
1   Jake   14     General Science  Gardening

PS: If we’re dealing with .csv file, we can use pd.read_csv() and it will directly convert to a dataframe object. Then we can employ the same steps above.

Conclusion

In this post, we discussed what is simple random sampling and how it is useful in the data science world. We also discussed how to create simple random sample for sequences of data and for the pandas.dataframe object. We also learned to handle any necessary exceptions.

Written by Aravind Sanjeev, an India-based blogger and web developer. Read all his posts. You can also find him on twitter.