Full name: Thien Tran Link to github page of the project

United States Cancer Statistics 1999-2018

image.png image-2.png

Final Data Science Tutorial by Thien Tran

1. Background

This project is a step-by-step walkthrough in order to explore the US cancer statistics from 1999 to 2018. The dataset was provided by Centers for Disease Control and Prevention (CDC) and National Cancer Institute (NCI) as a part of Wide-ranging ONline Data for Epidemiologic Research (WONDER) project. The dataset is publicly avaiable at United States Cancer Statistics.

Cancer incidence and mortality data are available for the United States, state and metropolitan areas (MSA) by age group, race, sex, year of diagnosis, and leading cancer site for the years 1999 - 2018.

2. Project goal

The ultimate goal of this project is to find out if there is any relationship among those aforementioned variables. Also, some visualization tools will be deployed to give insights to the statistical data as well as explore some (potentially) interesting characteristics. And ultimately, I aspire to answer the question: Can we predict how likely a cancer case is mortal?

3. Plan

Tools Some basic libraries and Python built-in methods will be used to process the data. Matplotlib is expected to be the primary visualization tool. This tutorial is made by one person, therefore, no collaboration plan will be made.

4. Data preview

Extraction

The dataset was downloaded from CDC website. Data documentation can also be found there. There are 4 seperate txt files: 2 for cancer incidences 1999-2008 and 2009-2018; 2 for cancer mortalities 1999-2008 and 2009-2018. The dataset was not undergone any data cleaning process.

Load

Before loading, some necessary libraries will be called in the cell below:

Now, let's load the 4 txt files using pd.read_csv and have a look at them: (note that in the originals files, each data is separated by tab)

Transform and clean

Apparently, the datasets we just imported are not clean. Firstly, there are some redundant (Year code, Age group, Sex, Notes) and not-so-informative (Leading Cancer Sites Code, Race Code, Notes) columns. Secondly, it is much better to work on one table only instead of four mostly identical tables.

Hence, for the first step, let's concatenate 2 incidence datasets and 2 mortality datasets using pd.concat(), the ignore_index=True allows us to make the index continous:

Great! Now let's look at the data types of each table and then we can decide what to do next.

The Notes columns seem to be strange, let's have look at that.

They are basically empty. So we can remove that column later. Leading Cancer Sites Code looks strange at the last indices, let's have a closer look at them:

The last type has different name so just rename it and the we can join two table together.

Let's manicure the two tables a bit and join them!

"Suppressed" values mean the number of cases are less then 16 but greater than 1. They are labelled as "Supressed" to protect the privacy of the minorities. Approximately, let's assume Suppressed = 8.

Looks good! That's just the basic cleaning. Now let's have some insights to our data. What type of cancer is the most popular?

What type of cancer causes highest deaths?

Male or female, who are more likely to contract pancreas cancer?

Let's visualize it! First is the incidences and deaths sorted by type of cancer

Look how dangerous Lung and Bronchus cancer is!

What is the trend in deaths from cancer among white community over 20 years?

It seems like around 2015, new therapies were introduce and resulted in the decreasing of deaths. And more information will be revealed in the futute! Stay tuned!

In order to have deeper insight into our dataset, we may have to introduce a new and independent set. Now we load "Total income by race" dataset fetched from United States Census Bureau. The original data was income of American household from 1967-2021, apparently we need the data from 1999 to 2018 only. Let's import and clean the data.

We are missing "American Indian or Alaska Native" in our income dataset, so we have to "make up" some data for this based on some facts I gathered around. Based on American Community Survey, Native Americans's median income in 2005-2009, 2015-2019 are \$43,622 and \$43,825, respectively.

Systematic statistics

Let's have a look at some basic summary of out datasets!

One conclusion: White people have more cancer cases, but Native and African American have higher death rate. But overall, the death rate decreases over time for all races.

Let's look and their income statistics.

The death rate follows the ascending trend: Asian - White - Black - Native, besides, the income follows the trend: Asian - White - Black - Native (decreasing in income). Is it a coincidence? The more income, the lower cancer death rate. It seems like cancer therapies have been costing people a lot of money, the more income they have, the less likely those cancer cases are mortal. It's time to embark on something specific. This time we only focus on "White American" data.

Among white community, from 25 years old and above, the likelihood of mortality of cancer cases increases monotinically. Moreover, mortality of children who have cancer is pretty high (20%).

Let's do the same thing with black community. As shown above, we can say that among the black community's data are pretty much the same as the white's, except for the apparently higher child mortality.

Esophagus and pancreas are the deadliest cancer sites!

Come back to the questions above that we would like to find an answer: Can we predict how likely a cancer case is mortal? Take a look at out dataset, so far, we have known that cancer counts and deaths are linked to sex (breast and prostate are extreme case), race, age agroup, cancer sites. We have 4 variables, and the label is the death rate. Since the labels are ranging from 0 to 1, this is more like a regression problem, when you enter 4 variables as inputs so as to get a probability as output. Therefore, k-nearest neighbors (k-NN) and linear regression are two strong candidates, but I am more inclined to k-NN because Age Group and death rate are not so linearly correlated.

Moreover, I want to try fitting the data into Random Forests model to see if I get better performance.

Firstly, let's implement k-nearest neighbors model. Note that I added 0.01 in order to avoid infinity results.

Let's visualize the MAE of kNN model versus mean absolute error (MAE). Based on that we can pick the optimal k.

The greater k, the smaller error. In average, we can build a decent kNN model with the MAE around 16. Now, we fit the same dataset to random forests model and evaluate the model based on number of estimators.

Remarkably, the error of this model is quite higher than k-nearest neighbors. However, the apparent tradeoff here is kNN algorithm runs much slower than random forests, so in further analysis, random forests may be more favorable. Let's predict some cases! What is the probability of a white female, who is 32, having liver cancer being mortal?

About 60%! How about a 45-year-old female who has prostate cancer?

It's about 3%. It doesn't make sense but our models did a good job!

5. Conclusion

The tutorial is a total walkthrough to explore cancer datasets and also gives readers insights into the stories they bring. Age, sex, cancer sites, and race are strongly correllated with the death rate of a cancer case, which reflects some facts about the mortality of this dangerous disease. k-nearest neighbors and random forests models are successfully employed to predict the death rate of a particular case. These model can be further optimized and it may help patiently seek approriate medication promptly.