Documentation for your capstone project
A pre-requisite for this homework is that you created a repository for your capstone project. This was an assignment of module 5. If you haven’t yet, please work through the steps outlined in Assignment 1 of Module 5 before you start with the steps outlined here.
Step 1: Create a new folder for processed data
Open the Content tab of the ds4owd-002 workspace on Posit Cloud: https://posit.cloud/spaces/663318/content/all?sort=name_asc
Open your capstone project by clicking on it. The project name starts with
project-USERNAMEwhereUSERNAMEis your GitHub username.Navigate to the Files tab in the bottom right window of RStudio.
Click on the
datafolder in the bottom right window.
If you don’t have a data folder yet, create it first by clicking the New Folder button and naming it data.
Click on the New Folder button to create a new folder.
Enter the name processed in the field and click OK.
Click on the new
processedfolder in the bottom right window to open it.
Step 2: Create a README.md file
Make sure you are in the
data/processedfolder in the Files tab (bottom right window of RStudio).Click on the Blank File button (with a plus icon) to create a new file.
Select the option Text File.
Enter the name README.md in the field and click OK.
Open this link in a new browser tab: https://github.com/ds4owd-002/metadata-readme-template
Click on the Raw button in the top right corner of the file view to see the raw markdown content.
Select all the content (Ctrl+A on Windows or Cmd+A on Mac) and copy it (Ctrl+C or Cmd+C).
Return to your Posit Cloud project and paste the content into the empty
README.mdfile you created in thedata/processedfolder.Save the file by clicking File > Save or using Ctrl+S (Cmd+S on Mac).
Commit this change:
- Navigate to the Git pane in the top-right window of RStudio
- Check the box next to
README.mdto stage it - Click Commit
- Enter commit message:
add README template to processed data folder - Click Commit and then Close
Step 3: Create a data dictionary
Open a spreadsheet tool of your choice (Microsoft Excel, Google Sheets, or LibreOffice Calc).
Create a new spreadsheet and add two column names in the first row:
variable_namedescription
You do not need to describe all variables yet. Start with at least 3-5 key variables from your dataset.
Save the file as dictionary.xlsx on your computer.
Also save the file as a CSV by selecting File > Save As or Export, then choose CSV format and name it dictionary.csv.
When you need to edit your data dictionary in the future: 1. Edit the .xlsx file in your spreadsheet software 2. Export/save it as .csv again 3. Re-upload the updated .csv file to your Posit Cloud project folder
Step 4: Upload the dictionary
Return to your capstone project on Posit Cloud: https://posit.cloud/spaces/663318/content/all?sort=name_asc
Navigate to the Files tab in the bottom right window.
Navigate to the
data/processedfolder by clicking on it.Click the Upload button (with an up arrow icon).
Click Choose File, select your
dictionary.csvfile, and click OK.
Upload only the .csv version of your dictionary to your project. The .xlsx file should remain on your local computer for editing purposes.
- Commit this change:
- Navigate to the Git pane in the top-right window of RStudio
- Check the box next to
dictionary.csvto stage it - Click Commit
- Enter commit message:
add data dictionary for processed data - Click Commit and then Close
Step 5: Prepare your analysis-ready (processed) data
It’s important to start this step now, even if your data preparation is not perfect yet. This step will involve several iterations, depending on the complexity of your raw data. For this homework assignment, create a first version so that we can start evaluating the complexity of your project.
Open the
index.qmdfile in your capstone project on Posit Cloud.Add a code chunk at the beginning of your document to load the necessary R packages:
```{r}
library(tidyverse)
library(here)
```- In a new code chunk, import your raw data. In this example we are using a CSV file:
```{r}
raw_data <- read_csv(here::here("data/raw/your-file-name.csv"))
```- Write code to bring your data into a state where it’s ready for analysis. For example:
- Rename columns with
rename() - Select columns that are relevant with
select() - Remove missing values with
filter()ordrop_na() - Join several dataframes with
left_join(),right_join(), etc. - Create new variables with
mutate()
- Rename columns with
- Once you have your data in a state where it’s ready for analysis, save it as a CSV file in the
data/processedfolder:
```{r}
processed_data <- raw_data |>
select(column1, column2, column3) |>
filter(!is.na(column1)) |>
rename(new_name = old_name)
write_csv(processed_data,
here::here("data/processed/my-processed-data.csv"))
```Replace my-processed-data.csv with a descriptive name for your processed dataset. This is the file that your data dictionary should describe.
- Commit these changes:
- Navigate to the Git pane in the top-right window of RStudio
- Check the boxes next to
index.qmdand your processed data file to stage them - Click Commit
- Enter commit message:
create processed data and update analysis document - Click Commit and then Close
Step 6: Push your changes to GitHub
You have made several commits in the previous steps. Now it’s time to push all of these commits to GitHub.
- Navigate to the Git pane in the top-right window of RStudio.
- Click on the Push button (with an up arrow icon).
- Enter your GitHub username and click OK.
- Enter your GitHub Personal Access Token (PAT) in the field and click OK. This is the personal access token you created in Module 1.
If you see a message that says HEAD -> main, then you have successfully pushed your changes to the remote repository on GitHub. Click Close. If you do not see this message, make sure you have entered your GitHub username and GitHub PAT correctly.
Step 7: Open an issue on GitHub
- Open https://github.com/ in your browser.
- Navigate to the GitHub organization for the course: https://github.com/ds4owd-002
- Find the repository
project-USERNAMEthat ends with your GitHub username, and open it by clicking on the repository name.- Replace
USERNAMEwith your actual GitHub username.
- Replace
You can search for your repository by typing your username in the search bar just below the Repository heading.
- You can verify here that your changes have been pushed to the remote repository by looking at the commit messages and files in your project.
- Click on the Issues tab.
- Click on the green New issue button.
- In the Title field write: Prepared first iteration of analysis-ready (processed) data.
- In the Leave a comment field, tag the course instructors
@seawaR@larnsceand briefly describe what you’ve added (README, dictionary, processed data). - Scroll down the page and click the green Submit new issue button.
Congratulations! You have completed the project metadata documentation assignment.