Documentation for your capstone project

Author
Affiliation

Lars Schöbitz

ETH Zurich

Have you created a repository for your capstone project?

A pre-requisite for this homework is that you created a repository for your capstone project. This was an assignment of module 5. If you haven’t yet, please work through the steps outlined in Assignment 1 of Module 5 before you start with the steps outlined here.

Step 1: Create a new folder for processed data

  1. Open the Content tab of the ds4owd-002 workspace on Posit Cloud: https://posit.cloud/spaces/663318/content/all?sort=name_asc

  2. Open your capstone project by clicking on it. The project name starts with project-USERNAME where USERNAME is your GitHub username.

  3. Navigate to the Files tab in the bottom right window of RStudio.

  4. Click on the data folder in the bottom right window.

Tip

If you don’t have a data folder yet, create it first by clicking the New Folder button and naming it data.

  1. Click on the New Folder button to create a new folder.

  2. Enter the name processed in the field and click OK.

  3. Click on the new processed folder in the bottom right window to open it.

Step 2: Create a README.md file

  1. Make sure you are in the data/processed folder in the Files tab (bottom right window of RStudio).

  2. Click on the Blank File button (with a plus icon) to create a new file.

  3. Select the option Text File.

  4. Enter the name README.md in the field and click OK.

  5. Open this link in a new browser tab: https://github.com/ds4owd-002/metadata-readme-template

  6. Click on the Raw button in the top right corner of the file view to see the raw markdown content.

  7. Select all the content (Ctrl+A on Windows or Cmd+A on Mac) and copy it (Ctrl+C or Cmd+C).

  8. Return to your Posit Cloud project and paste the content into the empty README.md file you created in the data/processed folder.

  9. Save the file by clicking File > Save or using Ctrl+S (Cmd+S on Mac).

  10. Commit this change:

    • Navigate to the Git pane in the top-right window of RStudio
    • Check the box next to README.md to stage it
    • Click Commit
    • Enter commit message: add README template to processed data folder
    • Click Commit and then Close

Step 3: Create a data dictionary

  1. Open a spreadsheet tool of your choice (Microsoft Excel, Google Sheets, or LibreOffice Calc).

  2. Create a new spreadsheet and add two column names in the first row:

    • variable_name
    • description
  3. You do not need to describe all variables yet. Start with at least 3-5 key variables from your dataset.

  4. Save the file as dictionary.xlsx on your computer.

  5. Also save the file as a CSV by selecting File > Save As or Export, then choose CSV format and name it dictionary.csv.

Working with your data dictionary

When you need to edit your data dictionary in the future: 1. Edit the .xlsx file in your spreadsheet software 2. Export/save it as .csv again 3. Re-upload the updated .csv file to your Posit Cloud project folder

Step 4: Upload the dictionary

  1. Return to your capstone project on Posit Cloud: https://posit.cloud/spaces/663318/content/all?sort=name_asc

  2. Navigate to the Files tab in the bottom right window.

  3. Navigate to the data/processed folder by clicking on it.

  4. Click the Upload button (with an up arrow icon).

  5. Click Choose File, select your dictionary.csv file, and click OK.

Warning

Upload only the .csv version of your dictionary to your project. The .xlsx file should remain on your local computer for editing purposes.

  1. Commit this change:
    • Navigate to the Git pane in the top-right window of RStudio
    • Check the box next to dictionary.csv to stage it
    • Click Commit
    • Enter commit message: add data dictionary for processed data
    • Click Commit and then Close

Step 5: Prepare your analysis-ready (processed) data

Start now, iterate later

It’s important to start this step now, even if your data preparation is not perfect yet. This step will involve several iterations, depending on the complexity of your raw data. For this homework assignment, create a first version so that we can start evaluating the complexity of your project.

  1. Open the index.qmd file in your capstone project on Posit Cloud.

  2. Add a code chunk at the beginning of your document to load the necessary R packages:

```{r}
library(tidyverse)
library(here)
```
  1. In a new code chunk, import your raw data. In this example we are using a CSV file:
```{r}
raw_data <- read_csv(here::here("data/raw/your-file-name.csv"))
```
  1. Write code to bring your data into a state where it’s ready for analysis. For example:
    • Rename columns with rename()
    • Select columns that are relevant with select()
    • Remove missing values with filter() or drop_na()
    • Join several dataframes with left_join(), right_join(), etc.
    • Create new variables with mutate()
  2. Once you have your data in a state where it’s ready for analysis, save it as a CSV file in the data/processed folder:
```{r}
processed_data <- raw_data |>
  select(column1, column2, column3) |>
  filter(!is.na(column1)) |>
  rename(new_name = old_name)

write_csv(processed_data,
          here::here("data/processed/my-processed-data.csv"))
```
Tip

Replace my-processed-data.csv with a descriptive name for your processed dataset. This is the file that your data dictionary should describe.

  1. Commit these changes:
    • Navigate to the Git pane in the top-right window of RStudio
    • Check the boxes next to index.qmd and your processed data file to stage them
    • Click Commit
    • Enter commit message: create processed data and update analysis document
    • Click Commit and then Close

Step 6: Push your changes to GitHub

You have made several commits in the previous steps. Now it’s time to push all of these commits to GitHub.

  1. Navigate to the Git pane in the top-right window of RStudio.
  2. Click on the Push button (with an up arrow icon).
  3. Enter your GitHub username and click OK.
  4. Enter your GitHub Personal Access Token (PAT) in the field and click OK. This is the personal access token you created in Module 1.

If you see a message that says HEAD -> main, then you have successfully pushed your changes to the remote repository on GitHub. Click Close. If you do not see this message, make sure you have entered your GitHub username and GitHub PAT correctly.

Step 7: Open an issue on GitHub

  1. Open https://github.com/ in your browser.
  2. Navigate to the GitHub organization for the course: https://github.com/ds4owd-002
  3. Find the repository project-USERNAME that ends with your GitHub username, and open it by clicking on the repository name.
    • Replace USERNAME with your actual GitHub username.
Tip

You can search for your repository by typing your username in the search bar just below the Repository heading.

  1. You can verify here that your changes have been pushed to the remote repository by looking at the commit messages and files in your project.
  2. Click on the Issues tab.
  3. Click on the green New issue button.
  4. In the Title field write: Prepared first iteration of analysis-ready (processed) data.
  5. In the Leave a comment field, tag the course instructors @seawaR @larnsce and briefly describe what you’ve added (README, dictionary, processed data).
  6. Scroll down the page and click the green Submit new issue button.

Congratulations! You have completed the project metadata documentation assignment.