Data science lifecycle & Exploratory data analysis using visualization

ds4owd - data science for openwashdata

Lars Schöbitz

ETH Zurich

Sep 18, 2025

Q: How do I successfully complete the course?

You successfully complete the course and you will receive a certificate of completion if you:

  1. Complete the quiz of each week by its due date

2- Hand in a complete capstone project report that uses a dataset of your choice by 11 December 2025 (instructions will follow)

While homework assignments are not required for completion, we highly recommend working on them and submitting it for feedback. This is the practice you need to become a data scientist.

Solving coding problems online

Tips for search engines (e.g. duckduckgo.com)

  • Use actionable verbs that describe what you want to do
  • Be specific
  • Add R to the search query
  • Add the name of the R package name to the search query
  • Scroll through the top 5 results (don’t just pick the first)

Example: “How to remove a legend from a plot in R ggplot2”

Stack Overflow

What is it?

  • The biggest support network for (coding) problems
  • Can be intimidating at first
  • Up-vote system

Workflow

  • First, briefly read the question that was posted
  • Then, read the answer marked as “correct”
  • Then, read one or two more answers with high votes

Tips for AI tools (e.g. duck.ai)

  • Use actionable verbs that describe what you want to do
  • Be specific
  • Add R to the search query
  • Add the name of the R package name to the search query

Example: “How to remove a legend from a plot in R ggplot2”

Other sources for help

Exercises & Assignments

Note

For every week, there will be two “projects”:

  • md-0X-exercises (this is for lectures and we only work with it during lectures)
  • md-0X-assignments-USERNAME (this is your homework and every module will have your username at the end of it. You will clone this to Posit Cloud to work with it)

Lecture Exercises

  • Click Start the first time
  • Click Continue the next time
  • It will look like two projects, but it is just one
  • It doesn’t matter on which of the two you click

Homework Assignments

on GitHub Organisation

Assignment 1 of Module 2 is a Bookmark Folder assignment!

on your repository

on Posit Cloud

on Posit Cloud

Module 2 - Assignment 1: Bookmarks

This week, you will have an assignment for creating bookmarks in your browser. It is very useful and will support you through the course and beyond.

Version Control - Terminology

-

-

-

-

-

-

-

-

-

-

-

-

remember: git commit

remember: git push

remember: git push

collaborate: git clone

track work: git commit

update: git ???

update: git push

git ???

new: git pull

Learning Objectives (for this week)

  1. Learners can list the six elements of the data science lifecycle.
  2. Learners can describe the four main aesthetic mappings that can be used to visualise data using the ggplot2 R Package.
  3. Learners can control the colour scaling applied to a plot using colour as an aesthetic mapping.
  4. Learners can compare three different geoms (bar/col, histogram, point) and their use case.

Data Science Lifecycle

-

-

-

-

-

-

-

Exploratory Data Analysis with ggplot2

R Package ggplot2

My turn: Working with R



Sit back and enjoy!

Take a break

Please get up and move!

10:00

Code structure

  • ggplot() is the main function in ggplot2
  • Plots are constructed in layers
  • Structure of the code for plots can be summarized as
ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], 
                     y = [y-variable])) +
  geom_xxx() +
  other options

Code structure

ggplot()

Code structure

ggplot(data = gapminder)

Code structure

ggplot(data = gapminder,
       mapping = aes()) 

Code structure

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = lifeExp))  

Code structure

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = lifeExp)) +
  geom_boxplot() 

Code structure

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = lifeExp)) +
  geom_boxplot() +
  theme_minimal()

Polls

Poll 1: What does the thick line inside the box of a boxplot represent?

  1. the mean of the observations
  2. the middle of the box
  3. the median of the observations
  4. none of the above

Poll 2: What percentage of observations are contained inside the box of a boxplot (interquartile range)?

  1. 25%
  2. depends on the median
  3. 50%
  4. none of the above

Poll 3: What is the median of a set of observations?

  1. The median is the most frequently occurring value in a dataset.
  2. The median is the sum of all values in a dataset divided by the number of observations.
  3. The median is the point above and below which half (50%) of the observations falls.
  4. The median is the square root of the sum of the squares of each value in a dataset.

Boxplot, explained

A diagram depicting how a boxplot is created following the steps outlined  above.

Figure 1: Diagram depicting how a boxplot is created.

Our turn: md-02-exercises

  1. Open posit.cloud/spaces/663318/content in your browser.
  2. Verify that the ds4owd workspace is open.
  3. Click Start next to md-02-exercises.
  4. In the File Manager in the bottom right window, locate the md-02b-visualize.qmd file and click on it to open it in the top left window.
30:00

Take a break

Please get up and move!

10:00

Visualizing data

Types of variables

numerical

discrete variables

  • non-negative
  • whole numbers
  • e.g. number of students, roll of a dice

continuous variables

  • infinite number of values
  • also dates and times
  • e.g. length, weight, size

non-numerical

categorical variables

  • finite number of values
  • distinct groups (e.g. EU countries, continents)
  • ordinal if levels have natural ordering (e.g. week days, school grades)

Histogram

  • for visualizing distribution of continuous (numerical) variables
ggplot(data = penguins,
       mapping = aes(x = body_mass_g)) +
  geom_histogram()

Barplot

  • for visualizing distribution of categorical (non-numerical) variables
ggplot(data = penguins,
       mapping = aes(x = species)) +
  geom_bar()

Scatterplot

  • for visualizing relationships between two continuous (numerical) variables
ggplot(data = gapminder_2007,
       mapping = aes(x = gdpPercap,
                     y = lifeExp,
                     size = pop,
                     color = continent)) +
  geom_point() +
  scale_color_colorblind() +
  theme_minimal()

Your turn: md-02-exercises

  1. If you closed it, open posit.cloud/spaces/663318/content in your browser.
  2. Verify that the ds4owd workspace is open.
  3. Click Continue next to md-02-exercises.
  4. In the File Manager in the bottom right window, locate the md-02c-boxplot.qmd file and click on it to open it in the top left window.
  5. Follow instructions in the file.
15:00

How to pick a plot?

Homework assignments module 2

Module 2 documentation

Homework due date

  • Homework assignment due: Wednesday, 2025-09-24
  • Quiz due: Wednesday, 2025-10-01

Wrap-up

Thanks!

Slides created via revealjs and Quarto: https://quarto.org/docs/presentations/revealjs/

Access slides as PDF on GitHub

All material is licensed under Creative Commons Attribution Share Alike 4.0 International.