Today's Course
Course Materials
- Data Cleaning with Python Jupyter notebook
- audible_uncleaned.csv dataset
- missing.csv toy 'dataset'
- Data Cleaning slide deck
- PDF copy of the Jupyter notebook
Learning Objectives
What will we likely need to know how to do in order to produce a clean dataset?
- fill in missing values or remove rows with missing values
- break-up columns containing more than one chunk of data within cells into multiple columns
- remove unecessary white space from cells
- standardize data (fix typos & inconsistencies; format dates; standardize data types)
- merge duplicate rows or drop duplicates
- remove unnecessary data (drop extraneous variables or observations)
- check for data discrepancies
Outside of this tutorial, the exact details for how to do each of these steps will differ, and a couple topics will be only briefly mentioned--namely, filling in missing values through value imputation and finding data outliers through statistical methods.
Data Cleaning with Python Video
The course will take approximately 2 hours to complete. You’re encouraged to take breaks as needed! Follow along in your Jupter notebook to our data cleaning with Python and Pandas video:
Reminders
To close out of Jupyter hit ‘File’ -> ‘Shutdown’ and close the browser window.
Additional Resources
Below are a list of resources that are mentioned in the Jupyter notebook:
- Pandas user guide: Interpolation methods using pandas and the scipy library
- Web page containing some general info on when to use different imputation techniques
- Numpy's datetime64 data type
- Stackoverflow thread on detecting and excluding outliers in a pandas dataframe
- Some more info on how to use stats to identify outliers
- w3schools - python
Finally, I adapted information found in the following pages to help create this tutorial: