Full name: Thien Tran Link to github page of the project
This project is a step-by-step walkthrough in order to explore the US cancer statistics from 1999 to 2018. The dataset was provided by Centers for Disease Control and Prevention (CDC) and National Cancer Institute (NCI) as a part of Wide-ranging ONline Data for Epidemiologic Research (WONDER) project. The dataset is publicly avaiable at United States Cancer Statistics.
Cancer incidence and mortality data are available for the United States, state and metropolitan areas (MSA) by age group, race, sex, year of diagnosis, and leading cancer site for the years 1999 - 2018.
The ultimate goal of this project is to find out if there is any relationship among those aforementioned variables. Also, some visualization tools will be deployed to give insights to the statistical data as well as explore some (potentially) interesting characteristics. And ultimately, I aspire to answer the question: Can we predict how likely a cancer case is mortal?
Tools Some basic libraries and Python built-in methods will be used to process the data. Matplotlib is expected to be the primary visualization tool. This tutorial is made by one person, therefore, no collaboration plan will be made.
The dataset was downloaded from CDC website. Data documentation can also be found there. There are 4 seperate txt files: 2 for cancer incidences 1999-2008 and 2009-2018; 2 for cancer mortalities 1999-2008 and 2009-2018. The dataset was not undergone any data cleaning process.
Before loading, some necessary libraries will be called in the cell below:
#Import necessary libraries
import numpy as np
import pandas as pd
#Ask pandas to show all columns
pd.options.display.max_rows = 10000
pd.options.display.width = 1000
Now, let's load the 4 txt files using pd.read_csv
and have a look at them: (note that in the originals files, each data is separated by tab)
Incidence_1999_2008 = pd.read_csv('./US_cancer_1999_2008_Incidence.txt', sep="\t")
Incidence_2009_2018 = pd.read_csv('./US_cancer_2009_2018_Incidence.txt', sep="\t")
Mortality_1999_2008 = pd.read_csv('./US_cancer_1999_2008_Mortality.txt', sep="\t")
Mortality_2009_2018 = pd.read_csv('./US_cancer_2009_2018_Mortality.txt', sep="\t")
display(Incidence_1999_2008)
display(Incidence_2009_2018)
display(Mortality_1999_2008)
display(Mortality_2009_2018)
Notes | Leading Cancer Sites | Leading Cancer Sites Code | Race | Race Code | Year | Year Code | Sex | Sex Code | Age Groups | Age Groups Code | Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | < 1 year | 1 | Suppressed |
1 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 1-4 years | 1-4 | Suppressed |
2 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 5-9 years | 5-9 | Suppressed |
3 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 10-14 years | 10-14 | Suppressed |
4 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 15-19 years | 15-19 | Suppressed |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
41795 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2008 | 2008 | Male | M | 65-69 years | 65-69 | 68 |
41796 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2008 | 2008 | Male | M | 70-74 years | 70-74 | 63 |
41797 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2008 | 2008 | Male | M | 75-79 years | 75-79 | 50 |
41798 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2008 | 2008 | Male | M | 80-84 years | 80-84 | 32 |
41799 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2008 | 2008 | Male | M | 85+ years | 85+ | 25 |
41800 rows × 12 columns
Notes | Leading Cancer Sites | Leading Cancer Sites Code | Race | Race Code | Year | Year Code | Sex | Sex Code | Age Groups | Age Groups Code | Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 2009 | 2009 | Female | F | < 1 year | 1 | Suppressed |
1 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 2009 | 2009 | Female | F | 1-4 years | 1-4 | Suppressed |
2 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 2009 | 2009 | Female | F | 5-9 years | 5-9 | Suppressed |
3 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 2009 | 2009 | Female | F | 10-14 years | 10-14 | Suppressed |
4 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 2009 | 2009 | Female | F | 15-19 years | 15-19 | Suppressed |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
41795 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 65-69 years | 65-69 | 177 |
41796 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 70-74 years | 70-74 | 222 |
41797 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 75-79 years | 75-79 | 167 |
41798 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 80-84 years | 80-84 | 111 |
41799 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 85+ years | 85+ | 89 |
41800 rows × 12 columns
Notes | Leading Cancer Sites | Leading Cancer Sites Code | Race | Race Code | Year | Year Code | Sex | Sex Code | Age Group | Age Group Code | Deaths | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | < 1 year | 1 | 0 |
1 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 1-4 years | 1-4 | Suppressed |
2 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 5-9 years | 5-9 | Suppressed |
3 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 10-14 years | 10-14 | 0 |
4 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 15-19 years | 15-19 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
33435 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2008 | 2008 | Male | M | 65-69 years | 65-69 | 888 |
33436 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2008 | 2008 | Male | M | 70-74 years | 70-74 | 1085 |
33437 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2008 | 2008 | Male | M | 75-79 years | 75-79 | 1478 |
33438 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2008 | 2008 | Male | M | 80-84 years | 80-84 | 1781 |
33439 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2008 | 2008 | Male | M | 85+ years | 85+ | 2345 |
33440 rows × 12 columns
Notes | Leading Cancer Sites | Leading Cancer Sites Code | Race | Race Code | Year | Year Code | Sex | Sex Code | Age Group | Age Group Code | Deaths | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 2009 | 2009 | Female | F | < 1 year | 1 | 0 |
1 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 2009 | 2009 | Female | F | 1-4 years | 1-4 | 0 |
2 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 2009 | 2009 | Female | F | 5-9 years | 5-9 | Suppressed |
3 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 2009 | 2009 | Female | F | 10-14 years | 10-14 | Suppressed |
4 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 2009 | 2009 | Female | F | 15-19 years | 15-19 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
33435 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2018 | 2018 | Male | M | 65-69 years | 65-69 | 1082 |
33436 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2018 | 2018 | Male | M | 70-74 years | 70-74 | 1456 |
33437 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2018 | 2018 | Male | M | 75-79 years | 75-79 | 1677 |
33438 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2018 | 2018 | Male | M | 80-84 years | 80-84 | 1890 |
33439 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2018 | 2018 | Male | M | 85+ years | 85+ | 3335 |
33440 rows × 12 columns
Apparently, the datasets we just imported are not clean. Firstly, there are some redundant (Year code, Age group, Sex, Notes) and not-so-informative (Leading Cancer Sites Code, Race Code, Notes) columns. Secondly, it is much better to work on one table only instead of four mostly identical tables.
Hence, for the first step, let's concatenate 2 incidence datasets and 2 mortality datasets using pd.concat()
, the ignore_index=True
allows us to make the index continous:
Incidence = pd.concat([Incidence_1999_2008,Incidence_2009_2018], ignore_index=True)
Mortality = pd.concat([Mortality_1999_2008,Mortality_2009_2018], ignore_index=True)
display(Incidence)
display(Mortality)
Notes | Leading Cancer Sites | Leading Cancer Sites Code | Race | Race Code | Year | Year Code | Sex | Sex Code | Age Groups | Age Groups Code | Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | < 1 year | 1 | Suppressed |
1 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 1-4 years | 1-4 | Suppressed |
2 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 5-9 years | 5-9 | Suppressed |
3 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 10-14 years | 10-14 | Suppressed |
4 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 15-19 years | 15-19 | Suppressed |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
83595 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 65-69 years | 65-69 | 177 |
83596 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 70-74 years | 70-74 | 222 |
83597 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 75-79 years | 75-79 | 167 |
83598 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 80-84 years | 80-84 | 111 |
83599 | NaN | Urinary Bladder, invasive and in situ | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 85+ years | 85+ | 89 |
83600 rows × 12 columns
Notes | Leading Cancer Sites | Leading Cancer Sites Code | Race | Race Code | Year | Year Code | Sex | Sex Code | Age Group | Age Group Code | Deaths | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | < 1 year | 1 | 0 |
1 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 1-4 years | 1-4 | Suppressed |
2 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 5-9 years | 5-9 | Suppressed |
3 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 10-14 years | 10-14 | 0 |
4 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 15-19 years | 15-19 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
66875 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2018 | 2018 | Male | M | 65-69 years | 65-69 | 1082 |
66876 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2018 | 2018 | Male | M | 70-74 years | 70-74 | 1456 |
66877 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2018 | 2018 | Male | M | 75-79 years | 75-79 | 1677 |
66878 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2018 | 2018 | Male | M | 80-84 years | 80-84 | 1890 |
66879 | NaN | Urinary Bladder | 29010 | White | 2106-3 | 2018 | 2018 | Male | M | 85+ years | 85+ | 3335 |
66880 rows × 12 columns
Great! Now let's look at the data types of each table and then we can decide what to do next.
Incidence.dtypes
Notes float64 Leading Cancer Sites object Leading Cancer Sites Code object Race object Race Code object Year int64 Year Code int64 Sex object Sex Code object Age Groups object Age Groups Code object Count object dtype: object
Mortality.dtypes
Notes float64 Leading Cancer Sites object Leading Cancer Sites Code object Race object Race Code object Year int64 Year Code int64 Sex object Sex Code object Age Group object Age Group Code object Deaths object dtype: object
The Notes columns seem to be strange, let's have look at that.
Incidence["Notes"].unique()
array([nan])
Mortality["Notes"].unique()
array([nan])
They are basically empty. So we can remove that column later. Leading Cancer Sites Code looks strange at the last indices, let's have a closer look at them:
Incidence["Leading Cancer Sites"].unique()
array(['Brain and Other Nervous System', 'Breast', 'Cervix Uteri', 'Colon and Rectum', 'Corpus Uteri', 'Esophagus', 'Gallbladder', 'Kidney and Renal Pelvis', 'Larynx', 'Leukemias', 'Liver', 'Lung and Bronchus', 'Melanoma of the Skin', 'Myeloma', 'Non-Hodgkin Lymphoma', 'Oral Cavity and Pharynx', 'Ovary', 'Pancreas', 'Prostate', 'Stomach', 'Thyroid', 'Urinary Bladder, invasive and in situ'], dtype=object)
Mortality["Leading Cancer Sites"].unique()
array(['Brain and Other Nervous System', 'Breast', 'Cervix Uteri', 'Colon and Rectum', 'Corpus Uteri', 'Esophagus', 'Gallbladder', 'Kidney and Renal Pelvis', 'Larynx', 'Leukemias', 'Liver', 'Lung and Bronchus', 'Melanoma of the Skin', 'Myeloma', 'Non-Hodgkin Lymphoma', 'Oral Cavity and Pharynx', 'Ovary', 'Pancreas', 'Prostate', 'Stomach', 'Thyroid', 'Urinary Bladder'], dtype=object)
The last type has different name so just rename it and the we can join two table together.
Incidence = Incidence.replace('Urinary Bladder, invasive and in situ', 'Urinary Bladder')
Incidence
Notes | Leading Cancer Sites | Leading Cancer Sites Code | Race | Race Code | Year | Year Code | Sex | Sex Code | Age Groups | Age Groups Code | Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | < 1 year | 1 | Suppressed |
1 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 1-4 years | 1-4 | Suppressed |
2 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 5-9 years | 5-9 | Suppressed |
3 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 10-14 years | 10-14 | Suppressed |
4 | NaN | Brain and Other Nervous System | 31010-31040 | American Indian or Alaska Native | 1002-5 | 1999 | 1999 | Female | F | 15-19 years | 15-19 | Suppressed |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
83595 | NaN | Urinary Bladder | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 65-69 years | 65-69 | 177 |
83596 | NaN | Urinary Bladder | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 70-74 years | 70-74 | 222 |
83597 | NaN | Urinary Bladder | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 75-79 years | 75-79 | 167 |
83598 | NaN | Urinary Bladder | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 80-84 years | 80-84 | 111 |
83599 | NaN | Urinary Bladder | 29010 | Other Races and Unknown combined | 2131-1 | 2018 | 2018 | Male | M | 85+ years | 85+ | 89 |
83600 rows × 12 columns
Let's manicure the two tables a bit and join them!
Incidence.drop(columns=["Notes", "Leading Cancer Sites Code", "Race Code", "Year Code", "Sex", "Age Groups"], inplace=True)
Mortality.drop(columns=["Notes", "Leading Cancer Sites Code", "Race Code", "Year Code", "Sex", "Age Group"], inplace=True)
Incidence.rename(columns={"Age Groups Code": "Age Group"}, inplace=True)
Mortality.rename(columns={"Age Group Code": "Age Group"}, inplace=True)
df = Incidence.merge(Mortality, on=["Leading Cancer Sites", "Race", "Year", "Sex Code", "Age Group"], how="outer")
df.replace({'1': 0, '1-4': 1, '5-9': 5, '10-14': 10, '15-19': 15, '20-24': 20, '25-29': 25, '30-34': 30,
'35-39': 35, '40-44': 40, '45-49': 45, '50-54': 50, '55-59': 55, '60-64': 60, '65-69': 65,
'70-74': 70, '75-79': 75, '80-84': 80, '85+': 85}, inplace=True)
"Suppressed" values mean the number of cases are less then 16 but greater than 1. They are labelled as "Supressed" to protect the privacy of the minorities. Approximately, let's assume Suppressed = 8.
df = df.replace({'Suppressed': 8})
df["Count"] = pd.to_numeric(df['Count'], errors="coerce")
df["Deaths"] = pd.to_numeric(df['Deaths'], errors="coerce")
df
Leading Cancer Sites | Race | Year | Sex Code | Age Group | Count | Deaths | |
---|---|---|---|---|---|---|---|
0 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 0 | 8.0 | 0.0 |
1 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 1 | 8.0 | 8.0 |
2 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 5 | 8.0 | 8.0 |
3 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 10 | 8.0 | 0.0 |
4 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 15 | 8.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... |
83595 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 65 | 177.0 | NaN |
83596 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 70 | 222.0 | NaN |
83597 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 75 | 167.0 | NaN |
83598 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 80 | 111.0 | NaN |
83599 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 85 | 89.0 | NaN |
83600 rows × 7 columns
Looks good! That's just the basic cleaning. Now let's have some insights to our data. What type of cancer is the most popular?
df.groupby(["Leading Cancer Sites"])["Count"].sum().idxmax()
'Breast'
What type of cancer causes highest deaths?
df.groupby(["Leading Cancer Sites"])["Deaths"].sum().idxmax()
'Lung and Bronchus'
Male or female, who are more likely to contract pancreas cancer?
df.groupby(["Leading Cancer Sites", "Sex Code"])["Count"].sum()
#Same likelihood
Leading Cancer Sites Sex Code Brain and Other Nervous System F 193517.0 M 237384.0 Breast F 4399782.0 M 40018.0 Cervix Uteri F 258451.0 M 0.0 Colon and Rectum F 1418504.0 M 1512601.0 Corpus Uteri F 883689.0 M 0.0 Esophagus F 72355.0 M 251454.0 Gallbladder F 53309.0 M 25440.0 Kidney and Renal Pelvis F 390202.0 M 645503.0 Larynx F 54143.0 M 201049.0 Leukemias F 380629.0 M 515759.0 Liver F 113039.0 M 315737.0 Lung and Bronchus F 1959879.0 M 2280589.0 Melanoma of the Skin F 543773.0 M 756420.0 Myeloma F 193554.0 M 234840.0 Non-Hodgkin Lymphoma F 584326.0 M 686504.0 Oral Cavity and Pharynx F 228325.0 M 534189.0 Ovary F 431176.0 M 0.0 Pancreas F 400669.0 M 410182.0 Prostate F 0.0 M 4182788.0 Stomach F 173306.0 M 274333.0 Thyroid F 563402.0 M 187842.0 Urinary Bladder F 340095.0 M 1038584.0 Name: Count, dtype: float64
Let's visualize it! First is the incidences and deaths sorted by type of cancer
df.groupby(["Leading Cancer Sites"])["Count"].sum().plot.bar().set_title("Cancer cases by sites")
Text(0.5, 1.0, 'Cancer cases by sites')
df.groupby(["Leading Cancer Sites"])["Deaths"].sum().plot.bar().set_title("Cancer deaths by sites")
Text(0.5, 1.0, 'Cancer deaths by sites')
Look how dangerous Lung and Bronchus cancer is!
What is the trend in deaths from cancer among white community over 20 years?
df[df["Race"] == "White"].groupby(["Year"])["Deaths"].sum().plot.line().set_title("Cancer deaths among white community over year ")
Text(0.5, 1.0, 'Cancer deaths among white community over year ')
It seems like around 2015, new therapies were introduce and resulted in the decreasing of deaths. And more information will be revealed in the futute! Stay tuned!
In order to have deeper insight into our dataset, we may have to introduce a new and independent set. Now we load "Total income by race" dataset fetched from United States Census Bureau. The original data was income of American household from 1967-2021, apparently we need the data from 1999 to 2018 only. Let's import and clean the data.
income_all = pd.read_csv("./income_1967_2021.csv",)
#Slicing 1999-2018 only
income_df = income_all.loc[32:51]
#Convert strings to numeric
income_df_clean = income_df[["All races", "White, non-Hispanic", "Black", "Hispanic (any race)", "Asian"]].apply(lambda x: x.str.replace(',', ''), axis=1).apply(pd.to_numeric)
income_df_clean["Year"] = income_df["Year"]
#income_df_clean = income_df_clean[["Year", "All races", "White", "Black or African American", "Other Races and Unknown combined", "Asian or Pacific Islander"]]
income_df_clean = income_df_clean[["Year","All races", "White, non-Hispanic", "Black", "Hispanic (any race)", "Asian"]]
income_df_clean.set_index("Year")
income_df_clean.columns = ["Year", "All races", "White", "Black or African American", "Other Races and Unknown combined", "Asian or Pacific Islander"]
income_df_clean
Year | All races | White | Black or African American | Other Races and Unknown combined | Asian or Pacific Islander | |
---|---|---|---|---|---|---|
32 | 1999 | 63423 | 68817 | 43497 | 47916 | 79419 |
33 | 2000 | 63292 | 68768 | 44718 | 49995 | 84043 |
34 | 2001 | 61889 | 67864 | 43191 | 49193 | 78607 |
35 | 2002 | 61190 | 67669 | 42098 | 47763 | 75439 |
36 | 2003 | 61113 | 67404 | 41885 | 46552 | 77964 |
37 | 2004 | 60901 | 67187 | 41534 | 47078 | 78917 |
38 | 2005 | 61553 | 67476 | 41128 | 47789 | 81114 |
39 | 2006 | 62033 | 67467 | 41353 | 48623 | 82237 |
40 | 2007 | 62865 | 68731 | 42664 | 48406 | 82442 |
41 | 2008 | 60624 | 66924 | 41392 | 45692 | 79020 |
42 | 2009 | 60200 | 65865 | 39608 | 46004 | 78699 |
43 | 2010 | 58627 | 64794 | 38258 | 44772 | 75582 |
44 | 2011 | 57732 | 63912 | 37331 | 44549 | 74965 |
45 | 2012 | 57623 | 64391 | 38084 | 44055 | 77010 |
46 | 2013 | 57808 | 64854 | 38704 | 45592 | 80661 |
47 | 2014 | 58725 | 67146 | 39021 | 44171 | 81897 |
48 | 2015 | 61748 | 65948 | 40646 | 46505 | 83867 |
49 | 2016 | 63683 | 68778 | 43217 | 49328 | 87180 |
50 | 2017 | 64806 | 70157 | 42865 | 51425 | 85491 |
51 | 2018 | 65127 | 72005 | 42977 | 52974 | 89491 |
df["Race"].unique()
array(['American Indian or Alaska Native', 'Asian or Pacific Islander', 'Black or African American', 'White', 'Other Races and Unknown combined'], dtype=object)
We are missing "American Indian or Alaska Native" in our income dataset, so we have to "make up" some data for this based on some facts I gathered around. Based on American Community Survey, Native Americans's median income in 2005-2009, 2015-2019 are \$43,622 and \$43,825, respectively.
df
Leading Cancer Sites | Race | Year | Sex Code | Age Group | Count | Deaths | |
---|---|---|---|---|---|---|---|
0 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 0 | 8.0 | 0.0 |
1 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 1 | 8.0 | 8.0 |
2 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 5 | 8.0 | 8.0 |
3 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 10 | 8.0 | 0.0 |
4 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 15 | 8.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... |
83595 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 65 | 177.0 | NaN |
83596 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 70 | 222.0 | NaN |
83597 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 75 | 167.0 | NaN |
83598 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 80 | 111.0 | NaN |
83599 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 85 | 89.0 | NaN |
83600 rows × 7 columns
Let's have a look at some basic summary of out datasets!
import matplotlib.pyplot as plt
from matplotlib.ticker import (MultipleLocator,
FormatStrFormatter,
AutoMinorLocator)
races = ['American Indian or Alaska Native', 'Asian or Pacific Islander',
'Black or African American', 'White',
'Other Races and Unknown combined']
fig, axes = plt.subplots(1, 3, figsize=(21, 4))
axes[0].set_title("Total cancer counts by race in the US 1999-2018")
axes[1].set_title("Total cancer deaths by race in the US 1999-2018")
axes[2].set_title("Cancer death rate by race in the US 1999-2018")
axes[0].xaxis.set_major_locator(MultipleLocator(2))
axes[1].xaxis.set_major_locator(MultipleLocator(2))
axes[2].xaxis.set_major_locator(MultipleLocator(2))
for race in races:
df[df["Race"] == race].groupby(["Year"])["Count"].sum().plot.line(ax = axes[0], label = race, legend=True)
df[df["Race"] == race].groupby(["Year"])["Deaths"].sum().plot.line(ax = axes[1], label = race, legend=True)
(df[df["Race"] == race].groupby(["Year"])["Deaths"].sum()/df[df["Race"] == race].groupby(["Year"])["Count"].sum()).plot.line(ax = axes[2], label = race, legend=True).set_ylabel("Death rate")
One conclusion: White people have more cancer cases, but Native and African American have higher death rate. But overall, the death rate decreases over time for all races.
Let's look and their income statistics.
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
ax.set_title("Median income by race in the US 1999-2018")
ax.set_ylabel("Annual income in dollars")
income_df_clean.plot.line(x="Year",ax = ax, label=race, legend=True).xaxis.set_major_locator(MultipleLocator(2))
The death rate follows the ascending trend: Asian - White - Black - Native, besides, the income follows the trend: Asian - White - Black - Native (decreasing in income). Is it a coincidence? The more income, the lower cancer death rate. It seems like cancer therapies have been costing people a lot of money, the more income they have, the less likely those cancer cases are mortal. It's time to embark on something specific. This time we only focus on "White American" data.
white = df[df["Race"]=="White"]
fig, axes = plt.subplots(1, 3, figsize=(21, 4))
axes[0].set_title("Total cancer counts by age group among white community")
axes[1].set_title("Total cancer deaths by age group among white community")
axes[2].set_title("Cancer death rate by age group among white community")
white.groupby("Age Group")["Count"].sum().plot(ax=axes[0])
white.groupby("Age Group")["Deaths"].sum().plot(ax=axes[1])
(white.groupby("Age Group")["Deaths"].sum()/white.groupby("Age Group")["Count"].sum()).plot(ax=axes[2])
<AxesSubplot:title={'center':'Cancer death rate by age group among white community'}, xlabel='Age Group'>
Among white community, from 25 years old and above, the likelihood of mortality of cancer cases increases monotinically. Moreover, mortality of children who have cancer is pretty high (20%).
Let's do the same thing with black community. As shown above, we can say that among the black community's data are pretty much the same as the white's, except for the apparently higher child mortality.
black = df[df["Race"]=='Black or African American']
fig, axes = plt.subplots(1, 3, figsize=(21, 4))
axes[0].set_title("Total cancer counts by age group among black community")
axes[1].set_title("Total cancer deaths by age group among black community")
axes[2].set_title("Cancer death rate by age group among black community")
black.groupby("Age Group")["Count"].sum().plot(ax=axes[0])
black.groupby("Age Group")["Deaths"].sum().plot(ax=axes[1])
(black.groupby("Age Group")["Deaths"].sum()/black.groupby("Age Group")["Count"].sum()).plot(ax=axes[2])
<AxesSubplot:title={'center':'Cancer death rate by age group among black community'}, xlabel='Age Group'>
(df.groupby(["Leading Cancer Sites"])["Deaths"].sum()/df.groupby(["Leading Cancer Sites"])["Count"].sum()).plot.bar().set_title("Death rate among cancer sites")
Text(0.5, 1.0, 'Death rate among cancer sites')
Esophagus and pancreas are the deadliest cancer sites!
Come back to the questions above that we would like to find an answer: Can we predict how likely a cancer case is mortal? Take a look at out dataset, so far, we have known that cancer counts and deaths are linked to sex (breast and prostate are extreme case), race, age agroup, cancer sites. We have 4 variables, and the label is the death rate. Since the labels are ranging from 0 to 1, this is more like a regression problem, when you enter 4 variables as inputs so as to get a probability as output. Therefore, k-nearest neighbors (k-NN) and linear regression are two strong candidates, but I am more inclined to k-NN because Age Group and death rate are not so linearly correlated.
Moreover, I want to try fitting the data into Random Forests model to see if I get better performance.
df
Leading Cancer Sites | Race | Year | Sex Code | Age Group | Count | Deaths | |
---|---|---|---|---|---|---|---|
0 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 0 | 8.0 | 0.0 |
1 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 1 | 8.0 | 8.0 |
2 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 5 | 8.0 | 8.0 |
3 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 10 | 8.0 | 0.0 |
4 | Brain and Other Nervous System | American Indian or Alaska Native | 1999 | F | 15 | 8.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... |
83595 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 65 | 177.0 | NaN |
83596 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 70 | 222.0 | NaN |
83597 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 75 | 167.0 | NaN |
83598 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 80 | 111.0 | NaN |
83599 | Urinary Bladder | Other Races and Unknown combined | 2018 | M | 85 | 89.0 | NaN |
83600 rows × 7 columns
Firstly, let's implement k-nearest neighbors model. Note that I added 0.01 in order to avoid infinity results.
#importing necessary tools from scikitlearn
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
# Define the features.
df["Death Rate"] = df["Deaths"]/(df["Count"]+0.01)
features = ["Leading Cancer Sites", "Race", "Sex Code",
"Age Group"]
#Drop NaN
df.dropna(inplace=True)
# Define the training data.
# Represent the features as a list of dicts.
X_train_dict = df[features].to_dict(orient="records")
y_train = df["Death Rate"]
# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
#Performance test using cross validation
def KNN_MAE(k):
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=k)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
scores = cross_val_score(pipeline, X_train_dict, y_train,
cv=5, scoring="neg_mean_absolute_error")
MAE = np.mean(-scores)
return MAE
Let's visualize the MAE of kNN model versus mean absolute error (MAE). Based on that we can pick the optimal k.
MAE = []
for i in np.arange(1,50):
MAE.append(KNN_MAE(i))
pd.DataFrame(MAE).plot.line(xlabel = "k", ylabel="error", legend=False)
<AxesSubplot:xlabel='k', ylabel='error'>
The greater k, the smaller error. In average, we can build a decent kNN model with the MAE around 16. Now, we fit the same dataset to random forests model and evaluate the model based on number of estimators.
from sklearn.ensemble import RandomForestRegressor
def random_forest(k):
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = RandomForestRegressor(n_estimators=k, max_features="auto", random_state=44)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
scores = cross_val_score(pipeline, X_train_dict, y_train,
cv=5, scoring="neg_mean_absolute_error")
MAE = np.mean(-scores)
return MAE
MAE2 = []
for i in np.arange(1,50):
MAE2.append(random_forest(i))
pd.DataFrame(MAE2).plot.line(xlabel = "n_estimators", ylabel="error", legend=False)
<AxesSubplot:xlabel='n_estimators', ylabel='error'>
Remarkably, the error of this model is quite higher than k-nearest neighbors. However, the apparent tradeoff here is kNN algorithm runs much slower than random forests, so in further analysis, random forests may be more favorable. Let's predict some cases! What is the probability of a white female, who is 32, having liver cancer being mortal?
X_new_dict = [{
"Leading Cancer Sites": "Liver",
"Race": "White",
"Sex Code": "F",
"Age Group": 32
}]
X_new = vec.transform(X_new_dict)
X_new_sc = scaler.transform(X_new)
kNN_model = KNeighborsRegressor(n_neighbors=50)
kNN_model.fit(X_train_sc, y_train)
kNN_model.predict(X_new_sc)
array([0.57269629])
RF_model = RandomForestRegressor(n_estimators=9, max_features="auto", random_state=44)
RF_model.fit(X_train_sc, y_train)
RF_model.predict(X_new_sc)
array([0.66483548])
About 60%! How about a 45-year-old female who has prostate cancer?
X_new_dict = [{
"Leading Cancer Sites": "Prostate",
"Race": "Black or African American",
"Sex Code": "F",
"Age Group": 45
}]
X_new = vec.transform(X_new_dict)
X_new_sc = scaler.transform(X_new)
kNN_model = KNeighborsRegressor(n_neighbors=50)
kNN_model.fit(X_train_sc, y_train)
kNN_model.predict(X_new_sc)
array([0.028965])
RF_model = RandomForestRegressor(n_estimators=9, max_features="auto", random_state=44)
RF_model.fit(X_train_sc, y_train)
RF_model.predict(X_new_sc)
array([0.02619615])
It's about 3%. It doesn't make sense but our models did a good job!
The tutorial is a total walkthrough to explore cancer datasets and also gives readers insights into the stories they bring. Age, sex, cancer sites, and race are strongly correllated with the death rate of a cancer case, which reflects some facts about the mortality of this dangerous disease. k-nearest neighbors and random forests models are successfully employed to predict the death rate of a particular case. These model can be further optimized and it may help patiently seek approriate medication promptly.