Final Exam

Due Saturday 11:59 pm (Week 15)

You cannot use any of the datasets in our assignments, class notes, and your own midterm

project. If you are using the same one, you will receive 0 for your final project.

1. Question Formulation (5 points): You need to devise a question that can be

answered through data analysis. This question should be of your own creation,

and it should reflect your curiosity and interest.

2. Data collection (10 points): You are responsible for finding the appropriate

dataset that aligns with your chosen question. Ensure that the data is clean and

organized for analysis. If you don’t know where to find the data set, you can use

Kaggle.com It can give you more inspiration about the question formulation and

data collection. You need to state where you get your data from in order to

receive credits.

3. Exploratory Data Analysis (30 points): Conduct an EDA to understand the

characteristics of your dataset. This step will help you gain insights and identify

patterns in the data. (Similar to Assignment 2.) Here are some key components of

EDA I am expecting from your paper: (6 points for each following component (if

your EDA does not have any categorical variable), or 5 points each (if your EDA

has the analysis of categorical variables.)

1) summary statistics: compute basic statistics for the dataset, such as mean,

median, standard deviation, minimum, maximum, and quartiles. It provides an

overview of the data’s central tendencies and spread.

2) Data Visualization: Create various plots and charts to visualize the data’s

distribution and relationships. Common visualization tools include histograms, box

plots, scatter plots, bar graphs, and line graphs.

3) Data Distribution: Examine the distribution of individual variables. This helps in

identifying whether the data is normally distributed, skewed, or exhibits other

patterns. Understanding the distribution can influence the choice of statistical

tests and modeling techniques.

4) Correlation Analysis: Determine the relationships between variables using

correlation coefficients or scatter plots. It can reveal potential associations and

dependencies between variables.

5) Categorical Variables (If your data involves this type of variable and you think it

is important to answer your question. If the categorical variables are not that

important to answer your questions, don’t worry about it.): Explore the

distribution of categorical variables using frequency tables, bar charts, or pie

charts.

6) Hypothesis Generation: Eventually your exploratory data analysis can lead to the

formulation of hypotheses about relationships or patterns in the data to answer your

question or guide further analysis.

4. Machine Learning (30 points): Build at least 3 different predictive models. (They can

answer the same question, and you will need to compare their performances and pick the best

one. Or they can answer different questions. You have a lot of flexibilities for this step, but all

the models should help you answer the question that you find in step 1.

5. Project Structure (20 points): While this is a mini-project, your report should follow a

structure similar to a combination of Assignment 2 and Assignment 3. This means it

should include sections for introduction, Data collection and Preprocessing, EDA,

Machine Learning, Results and Discussion, and Conclusion.

6. Data Attribution and References (5 points): In the conclusion section of your report,

make sure to include a subsection titled “Data Attribution and References.” In this

subsection, provide a detailed list of the sources where you obtained your data, including

the dataset name, the organization or website from which it was sourced, and any

relevant publication or citation information.

Additionally, if you consulted external research papers, articles, or resources during your

project, please list these references in the same section.

General Requirements

1) You will need to write up your questions, findings, interpretations, and results for

this assignment. It will be a great idea to screenshot your codes, results, and graphs

so that you can explain your findings along with them. (It is also easier for me to

follow you when I read your paper). A pdf file is required. There is no page limit but

try to be straightforward with your answers.

2) The py file that you have used to finish your assignment. (It may be a duplicate or

somewhat duplicate of the screenshots that you have inserted in your paper but

that is okay. I would like to look over your codes.)