Set up

Install R and RStudio

First things first. To complete the exercises, you first have to install the R base system and RStudio Desktop. RStudio is an integrated development environment (IDE) for R, which will help you to have easier control over your R code. If you are in doubt about something, don’t hesitate to look at the RStudio Cheatsheet

Tip: Cheatsheets are a very useful resource to help you out. Find here a wide variety of cheatsheets for all kinds of R applications.

Get the course materials

You can find all course materials in the maptimeR repository on GitHub. The easiest way to start with the exercises, is to clone this repository (of course, you can also fork it first, such that you can push your results to your own copy of the repository afterwards). If you don’t have Git, or don’t want to use it, you can manually download the files exercises.Rmd and app.Rmd and store it in a new directory named maptimeR.

Create a project

Now, you can open RStudio! The most convenient way to work on a specific task in RStudio, is to create a project for it. Each project has its own working directory, workspace, history, and source documents. If you cloned the git repository, a project file is already in there, and you can simply open it by clicking File -> Open Project. If you only downloaded the exercises, you click instead on File -> New Project, choose Existing Directory, and select the maptimeR directory in which you stored the exercises.

Open the exercises source file

In the bottom right square of your screen, under the header Files, you will see a list of all files that are part of your projects directory. One of them is exercises.Rmd. The file extension .Rmd stands for R Markdown. Markdown itself is a simple formatting language where you write your text in plain text rather than in complicated code like HTML or LateX. R Markdown enables you to embed ‘chunks’ of R code in such Markdown files. Like that, it is used for making dynamic documents with R. To work with it, we do need to install the rmarkdown package. Do so by running install.packages("rmarkdown") in the console pane, which you see in the bottom left square of your screen.

Now, you can open exercises.Rmd by clicking on it. If you have a strong aversion against clicking, you can also run file.edit("exercises.Rmd") in the console pane.

Get started

Structure

The exercises are divided into several chapters. Each chapter provides some explanation, accompanied with examples, and ends with a set of hands-on exercises. Some of these exercises can be answered with just text, but most of them require you to write R code. The most convenient way is to include your answers in the exercises.Rmd file that you opened in RStudio during setup. Under each exercise, you can - on a new line - either type your textual answer in plain text, or create an R code chunk in which you write R code. You can see that each R code chunk starts with ```{r} and ends with ```. In between those two, you can put R code. You can execute the code of each individual chunk by clicking the green arrow in the top-right of the chunk, or by pressing Ctrl+Shift+Enter. Alternatively, you can execute all chunks in the whole document by pressing Ctrl+Alt+R.

Tip: In R Markdown files, you are not limited to chunks of R code, but you can also include chunks of, for example, Python and SQL code! See here.

Understand the basics of R Markdown

The great thing about R Markdown is that you can easily render your document in a readable format, such as a web page, a PDF document, or a Word document. You can do this easiest with the Knit button in the menu just above the file. This makes it also much nicer to share your work! To get a basic understanding of how to work with R Markdown, please read and/or watch the R Markdown Quick Tour before you proceed! Also, you can always look at the R Markdown Cheatsheet.

Tip: If you want to export your file to PDF, you need to have LaTeX. You don’t have to leave RStudio for this! Install the R package tinytex with install.packages("tinytex") and then run tinytex::install_tinytex() to install TinyTex on your computer. This is just if you want to try, rendering to HTML is what we’ll do in this course.

Install packages

In the exercises, you will use several R packages. R packages contain (mainly) functions that extend the functionality of the base R language. They are written by developers who love to share their work and make your life easier! Packages can be installed from a central package repository called CRAN with install.packages("< package name >").

Any R developer who takes his work seriously, will probably upload his package to CRAN at some point. However, packages that are in earlier development will not be in the CRAN repository yet. Often, they just live on GitHub. With the remotes package, you can install R packages directly from GitHub: remotes::install_github("< user/repo >").

Let’s install now the packages we will use (if you don’t already have them). Don’t worry, they are all on CRAN. Instead of calling install.packages() multiple times, we can provide it a single vector (created as c()) containing multiple package names.

install.packages(c(
  "tidyverse", # a meta-package that contains a set of packages designed for data science
  "sf", # for spatial vector data analysis
  "tmap", # for advanced thematic mapping
  "mapview", # for interactive mapping
  "ggplot2", # for fancy plots
  "plotly", # for interactive plots
  "shiny", # for creating a web-app
  "flexdashboard" # for including shiny in a dashboard
))

Installing a package, means that the package code is now on your computer, but it does not mean it is also loaded into your current R session. To do that, you will have to run library(< package name >). You can also run a function from an installed package without loading the package explicitly, by running < package name >::< function name >(). When you used library(), there is no need anymore to include < package name >:: before a function call, but you might still want to do it such that others (including your future self) know exactly to which package a used function belongs.

Tip: A nice thing about R and its packages is that the documentation is always available. If you want to know how a function works, just type ? in front of any function on the console, and you will get the help page for it. e.g. ?install.packages.

Load data

We can load data directly from a URL, by using the download.file() function. Here, we download the data only if it does not exist yet. Like that, you will not waste time on downloading a file that you already have!

# We will store the data in a folder called data.
# We only create this folder when it does not exist yet!
if (!dir.exists("data")) dir.create("data")

# Define where to get the data from, and where to store it.
src = "https://www.salzburg.gv.at/ogd/7b5df951-b8b0-4153-b599-490b4fe39d8b/Burgen_und_Schloesser_Shapefile.zip"
dst = "data/castles.zip"

# Download and unzip the file, if it does not exist yet.
if (!file.exists(dst)) {
  download.file(src, dst)
  unzip(dst, exdir = "data")
}

To read and manipulate vector data in R, we will use the sf package. First, we load the package into our current R session, such that we can use its functions directly.

library(sf)
## Linking to GEOS 3.7.1, GDAL 2.2.2, PROJ 4.9.2

You see that when loading the package, you get informed that it links to respectively GDAL, GEOS and PROJ. These are the well-known geospatial workhorse libraries that get used by a lot of software (including ArcGIS, QGIS, PostGIS and the spatial packages of Python). GDAL can read, write and convert a wide variety of geospatial file formats, GEOS provides functions for two-dimensional geometry operations (e.g. intersections, unions, buffers), and PROJ takes care of coordinate reference systems. What sf does, is bringing the functionalities of these libraries into R, in a clear, unified structure.

For data reading, this means that sf can read all data formats that are understood by GDAL. It also supports loading data from databases and directly from URLs. Data reading is now as simple as just providing a file location to the read function. It will (try to) automatically determine what type of format it is, and act accordingly.

# Read the downloaded shapefile.
castles = st_read("data/Burgen_und_Schloesser/Burgen_und_Schloesser.shp")
## Reading layer `Burgen_und_Schloesser' from data source `/home/luuk/Documents/maptime/maptimeR/data/Burgen_und_Schloesser/Burgen_und_Schloesser.shp' using driver `ESRI Shapefile'
## Simple feature collection with 104 features and 9 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 369853.9 ymin: 215126.1 xmax: 488768.8 ymax: 314745.1
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=0 +lon_0=13.33333333333333 +k=1 +x_0=450000 +y_0=-5000000 +ellps=bessel +units=m +no_defs

castles is now an sf object that lives in your R environment, which you can see in the top-right pane of RStudio.

Note: When you are a Windows user, installing sf brings GDAL, GEOS and PROJ along. However, if you use Linux, you have to have these libraries installed on your system (we assume that there are no geospatial Linux users who don’t have this, though). See here for the sf installation guide.

In spatial data science, you can unfortunately not get around dirty CSV files that you’ll have to join with your spatial data, so let’s also show how to load such a guy into R. read.csv() does the job for you.

# Of course, governments love to put multi-line headers in CSV files (please don't!)
# Therefore, we have to read starting from the 3rd line. Hence, skip = 2.
population_sbg = read.csv("https://www.salzburg.gv.at/ogd/f3148c9c-4e14-4258-b00f-b7695b75d480/population-szg-sex-age-city.csv", skip = 2, sep = ";")

Again, population_salzburg now lives in your R environment. This time not as a spatial sf object, but as a regular data.frame. The file contains information about the population of each municipality in the province of Salzburg, for several years.

Exercises

1. Run st_drivers() in the console to see all file formats supported by sf.

Explore data

Before analyzing data, it is very important to know what you’re dealing with! Therefore, you should always do a quick exploration of your data. How many observations and variables do I have? What is the CRS of the data? How does it look like on a map? Et cetera.

A core data structure in R is the so-called data frame, which is very similar to a database table. We call a data frame tidy if each row is an observation, and each column a variable. An sf object extends such a data frame, adding a special/spatial column, in which the geometry of each observation is stored. This geometry is called sticky, since it always sticks to the observation, no matter what you do. For example, if you select a specific variable, the geometry column will not be removed. Additionally, each sf object contains information about some spatial properties of the data, for example: the geometry type, the spatial extent (i.e. the bounding box), and the coordinate reference system.

(Illustration by Allison Horst)

The most common geometry types supported by sf are the following:

type description
POINT zero-dimensional geometry containing a single point
LINESTRING sequence of points connected by straight, non-self intersecting line pieces; one-dimensional geometry
POLYGON geometry with a positive area (two-dimensional); sequence of points form a closed, non-self intersecting ring; the first ring denotes the exterior ring, zero or more subsequent rings denote holes in this exterior ring
MULTIPOINT set of points; a MULTIPOINT is simple if no two Points in the MULTIPOINT are equal
MULTILINESTRING set of linestrings
MULTIPOLYGON set of polygons
GEOMETRYCOLLECTION set of geometries of any type except GEOMETRYCOLLECTION

(Source: sf documentation)

For PostGIS users, these types probably seem familiar. That is because sf implements a formal standard on how objects in the real world can be represented and stored in computers, and which geometrical operations should be defined for them. This standard is called Simple Feature Access. So, now you also know why sf is called sf!

Okay, enough theory. How do we access all this information, such that we can explore the data? The head() function is a good place to start. It shows you the first few rows of a data frame.

head(castles)
## Simple feature collection with 6 features and 9 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 428227.9 ymin: 260199.2 xmax: 432843 ymax: 295754.9
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=0 +lon_0=13.33333333333333 +k=1 +x_0=450000 +y_0=-5000000 +ellps=bessel +units=m +no_defs
##                Name                 Adresse HNR      Ort  PLZ Gemeinde
## 1      Schloss Anif Salzachtal-Bundesstraße   9     Anif 5081     Anif
## 2 Schloss Blühnbach         Blühnbachstraße  53  Tenneck 5451   Werfen
## 3     Schloss Aigen  Schwarzenbergpromenade  37 Salzburg 5026 Salzburg
## 4  Schloss Arenberg          Arenbergstraße  10 Salzburg 5020 Salzburg
## 5       Edmundsburg              Mönchsberg   2 Salzburg 5020 Salzburg
## 6   Schloss Emsburg       Hellbrunner Allee  52 Salzburg 5020 Salzburg
##   GEMNR                                                            DBLink
## 1 50301             http://www.burgen-austria.com/Archiv.asp?Artikel=Anif
## 2 50424        http://www.burgen-austria.com/Archiv.asp?Artikel=Blühnbach
## 3 50101       http://www.salzburg.gv.at/themen/ks/kultur/burgen/aigen.htm
## 4 50101    http://www.salzburg.gv.at/themen/ks/kultur/burgen/arenberg.htm
## 5 50101 http://www.salzburg.gv.at/themen/ks/kultur/burgen/edmundsburg.htm
## 6 50101     http://www.salzburg.gv.at/themen/ks/kultur/burgen/emsburg.htm
##   PDFLink                  geometry
## 1    <NA> POINT (430134.6 289692.9)
## 2    <NA>   POINT (432843 260199.2)
## 3    <NA>   POINT (431729 294215.6)
## 4    <NA> POINT (429486.6 295754.9)
## 5    <NA> POINT (428227.9 295454.4)
## 6    <NA> POINT (429678.7 292559.1)

You can retrieve the total number of rows and columns by running dim().

dim(castles)
## [1] 104  10

These are both basic R functions. To access the spatial properties of an sf object, you’ll need specific functions from the sf package. For example, st_crs() shows you information on the coordinate reference system. Similarly, st_bbox() shows the spatial extent of the data.

st_crs(castles)
## Coordinate Reference System:
##   No EPSG code
##   proj4string: "+proj=tmerc +lat_0=0 +lon_0=13.33333333333333 +k=1 +x_0=450000 +y_0=-5000000 +ellps=bessel +units=m +no_defs"
st_bbox(castles)
##     xmin     ymin     xmax     ymax 
## 369853.9 215126.1 488768.8 314745.1

You can extract only the geometry of the data (without all attributes) with the st_geometry() function. This is helpful when you just want to quickly see how the data points are aligned in space. You do so by plotting the geometry of the data.

plot(st_geometry(castles))

However, nowadays it is just as easy to create a quick interactive map of your data, and explore them interactively. The mapview package is a great option to for doing this in just one line of code. Click on a point and check the attributes!

library(mapview)
mapview(castles)

Exercises

1. Explore the other datasets you loaded before. What do they have in common? What are the differences?

  • HINT: Don’t include the interactive map for the municipalities in the .Rmd, but run the code just in your console instead. Such a big map (more than 2000 detailed polygons) will make your .Rmd very large and slow to render.

2. What are challenges you identify for merging/joining all data we have loaded so far?

3. Test different background layers for mapview interactive maps

Wrangle data

Now we have reached the fun part, where the real analysis starts! With data wrangling we mean cleaning, filtering, transforming and joining datasets such that we get the results we want. In R, there exists a set of packages, known as the tidyverse, meant to make your data wrangling life easier and more intuitive. All these packages share an underlying design philosophy, and therefore interact well with each other. You can load all of the core tidyverse packages at once by running library(tidyverse). You will see that eight packages are loaded at once, at that two base R functions are overwritten by tidyverse functions with the same name.

library(tidyverse)
## ── Attaching packages ─────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1          ✔ purrr   0.3.3     
## ✔ tibble  2.1.3          ✔ dplyr   0.8.3     
## ✔ tidyr   1.0.0.9000     ✔ stringr 1.4.0     
## ✔ readr   1.3.1          ✔ forcats 0.4.0
## ── Conflicts ────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

The main data wrangling workhorse of the tidyverse is the dplyr package. The most important functions of that package are:

function description
filter() selecting specific observations, i.e. rows
select() selecting specific variables, i.e. columns
mutate() adding a new variable, possibly as a function of other variables
group_by() splitting a dataset in groups, based on a variable
summarise() summarise a dataset by applying a function to a variable, e.g. mean, sum, ..
*_join() joining two datasets based on a variable. Can be a left_join, inner_join, et cetera

Iconic for the tidyverse design is the so-called pipe function: %>%. The pipe is meant to make your workflow of applying functions to data more intuitive. Normally, you read code (in R, but also in Python and other languages) from the inside to the outside. That is, if you want to select some observations from your data, then select a specific variable, and then summarise by calculating the average value of this variable for your selected set of observations, you would write this somehow like:

summarise(select(filter(data)))

The pipe command, however, forwards the result of one function as first argument to the next function. In that way, you would write the same workflow like:

data %>% filter %>% select %>% summarise

Remember that there is always a cheatsheet to help you out when your are lost: dplyr cheatsheet.

NOTE: The tidyverse package are an extension of base R, and not a replacement! Almost everything you do with the tidyverse, you can also do without loading any packages. The tidyverse just makes your work easier to structure and easier to understand. At least, that is what we think.

Now, let’s showcase the functionality of all core dplyr functions and the pipe. Remember that our dataset population_salzburg contained information about the population of each municipality in the province of Salzburg, for several years. Each municipality has two observations (i.e. rows) per year, one for female population, one for male population. The most recent year in the data is 2015. Remember also that you loaded the shapefile of municipalities in Austria in a previous exercise. So, let’s formulate the following goal: we want to have a spatial dataset (i.e. including geometry) of all municipalities in the province of Salzburg, containing a variable that describes total population (male plus female) in that municipality.

First, we filter the observations of the population data such that we only have the most recent year. Then, we select only the variables of interest, which are CITY_CODE describing the unique id of a municipality, SEX describing if the observation regards males or females and POP_TOTAL describing the total population observed.

population_sbg_2015 = population_sbg %>%
  filter(REF_YEAR == 2015) %>%
  select(CITY_CODE, SEX, POP_TOTAL)

head(population_sbg_2015)
##   CITY_CODE SEX POP_TOTAL
## 1     50101   1     70385
## 2     50101   2     78035
## 3     50201   1      2897
## 4     50201   2      2904
## 5     50202   1      1763
## 6     50202   2      1726

Now, we still have two lines per municipality (i.e. one for male, one for female), which we want to add up. That is: we group the data by municipality (column CITY_CODE), and then for each group, summarise the POP_TOTAL column by calculating the sum of its values.

total_population_sbg_2015 = population_sbg_2015 %>%
  group_by(CITY_CODE) %>% 
  summarise(population = sum(POP_TOTAL))

head(total_population_sbg_2015)
## # A tibble: 6 x 2
##   CITY_CODE population
##       <int>      <int>
## 1     50101     148420
## 2     50201       5801
## 3     50202       3489
## 4     50203       2236
## 5     50204       4221
## 6     50205      20607

Great! That looks useful. Now we have to somehow add the geometries of the municipalities. These are stored in the municipalities dataset, where the equivalent to the CITY_CODE column is called id. However, there is one last technical issue we have to solve before we can join the two. We are dealing with governmental data, so of course, it is messy. The data type of CITY_CODE in total_population_sbg_2015 is not equal to the data type of id in municipalities. You do not have to worry here about understanding data types in detail, but you do have to know that we have to make them identical before joining. Therefore, we will create a new column in total_population_sbg_2015 which overwrites the old CITY_CODE, keeping the same values, but changing the data type.

total_population_sbg_2015 = total_population_sbg_2015 %>%
  mutate(CITY_CODE = as.factor(CITY_CODE))

Finally, time to join! The most used joins are a left_join and an inner_join.

type description
inner_join return all rows from x where there are matching values in y, and all columns from x and y.
left_join return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA (i.e. ‘not available’) values in the new columns.

(Source: dplyr documentation)

For our case now, we want to keep only the municipalities in Salzburg from the Austrian wide municipalities dataset, and add the population information to them. That is, we want to return all rows from municipalities where there are matching values in total_population_sbg_2015. So that is a….inner_join!

Note: A left_join(total_population_sbg_2015, municipalities) would in this case have given the same result (think about why!), but it would have kept the type of the population dataset, which was a data.frame, and not a spatial sf object. Hence, you would have had to convert the result to an sf object afterwards.

municipalities_sbg = municipalities %>%
  inner_join(total_population_sbg_2015, by = c('id' = 'CITY_CODE'))

head(municipalities_sbg)
## Simple feature collection with 6 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 374015 ymin: 395438.6 xmax: 413827.4 ymax: 439501.5
## epsg (SRID):    31287
## proj4string:    +proj=lcc +lat_1=48.99999999999999 +lat_2=46 +lat_0=47.5 +lon_0=13.33333333333334 +x_0=400000 +y_0=400000 +ellps=bessel +towgs84=601.705,84.263,485.227,4.7354,-1.3145,-5.393,-2.3887 +units=m +no_defs
##      id                   name population                       geometry
## 1 50101               Salzburg     148420 MULTIPOLYGON (((380697.9 43...
## 2 50201                Abtenau       5801 MULTIPOLYGON (((406462.1 40...
## 3 50202                  Adnet       3489 MULTIPOLYGON (((386576.8 41...
## 4 50203       Annaberg-Lungötz       2236 MULTIPOLYGON (((406416.4 40...
## 5 50204 Golling an der Salzach       4221 MULTIPOLYGON (((387546.6 40...
## 6 50205                Hallein      20607 MULTIPOLYGON (((384067.8 41...
mapview(municipalities_sbg)

This seems to have worked very well! Now you know how to use the most important dplyr functions, time to move them to the spatial domain. Because after all, the join we just did, was based on an id that was present in both datasets, not on any explicit spatial properties.

sf is full of functions to wrangle spatial data. You may have already noticed that most of these functions start with st_. Again, if your are a PostGIS user, this will look familiar. Not a coincidence: remember that sf is largely based on official standards, that also define valid geometrical operations. Some examples:

st_transform() will transform the data into a different CRS:

# Castles to EPSG:3035
castles %>%
  st_transform(crs = 3035) %>%
  st_crs()
## Coordinate Reference System:
##   EPSG: 3035 
##   proj4string: "+proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
# You can also use the CRS of another dataset.
castles %>%
  st_transform(crs = st_crs(municipalities)) %>%
  st_crs()
## Coordinate Reference System:
##   EPSG: 31287 
##   proj4string: "+proj=lcc +lat_1=48.99999999999999 +lat_2=46 +lat_0=47.5 +lon_0=13.33333333333334 +x_0=400000 +y_0=400000 +ellps=bessel +towgs84=601.705,84.263,485.227,4.7354,-1.3145,-5.393,-2.3887 +units=m +no_defs"

st_area() will calculate the area of a geometry:

# Calculate the area of the municipality of Salzburg city.
municipalities_sbg %>%
  filter(name == "Salzburg") %>%
  st_area()
## 65641400 [m^2]

st_buffer() draws a buffer around a geometry:

# Draw a buffer of 500m around the Festung Hohensalzburg.
castles %>%
  filter(Name == "Festung Hohensalzburg") %>%
  st_buffer(500) %>%
  mapview()

st_union() merges several geometries into one:

# Merge all municipalities in the Salzburg province.
municipalities_sbg %>%
  st_union() %>%
  mapview()

Hopefully you noticed now that sf functions (and also mapview) work really well together with dplyr and pipes. Some dplyr functions work slightly different on sf objects, because of the sticky geometry that we talked about before. For example, selecting a variable will always keep the geometry along.

# Here, we select only Name, but the geometry sticks with it.
castles %>%
  select(Name)
## Simple feature collection with 104 features and 1 field
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 369853.9 ymin: 215126.1 xmax: 488768.8 ymax: 314745.1
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=0 +lon_0=13.33333333333333 +k=1 +x_0=450000 +y_0=-5000000 +ellps=bessel +units=m +no_defs
## First 10 features:
##                                  Name                  geometry
## 1                        Schloss Anif POINT (430134.6 289692.9)
## 2                   Schloss Blühnbach   POINT (432843 260199.2)
## 3                       Schloss Aigen   POINT (431729 294215.6)
## 4                    Schloss Arenberg POINT (429486.6 295754.9)
## 5                         Edmundsburg POINT (428227.9 295454.4)
## 6                     Schloss Emsburg POINT (429678.7 292559.1)
## 7                   Schloss Frohnburg POINT (429652.7 293224.2)
## 8                    Schloss Freisaal POINT (429378.2 294304.9)
## 9                   Schloss Hellbrunn   POINT (429582.8 291547)
## 10 Schloss Hellbrunn - Monatsschlössl POINT (429860.5 291272.8)

Summarising a variable will also summarise the geometry. That is, when you summarise a variable for a set of points, the POINT geometry of all these points will be merged together into a single MULTIPOINT geometry. See it for yourself:

# Count all the castles in the data. Counting is done with the function n().
castles %>%
  summarise(total_count = n())
## Simple feature collection with 1 feature and 1 field
## geometry type:  MULTIPOINT
## dimension:      XY
## bbox:           xmin: 369853.9 ymin: 215126.1 xmax: 488768.8 ymax: 314745.1
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=0 +lon_0=13.33333333333333 +k=1 +x_0=450000 +y_0=-5000000 +ellps=bessel +units=m +no_defs
##   total_count                       geometry
## 1         104 MULTIPOINT (369853.9 235736...

Here, you are probably not so interested in this MULTIPOINT geometry. No worries, you can easily get rid of the geometry by setting it to NULL. Of course, the data will then lose its spatial properties.

castles %>%
  summarise(total_count = n()) %>%
  st_set_geometry(NULL)
##   total_count
## 1         104

Finally, also the join functionality has a spatial counterpart: st_join(). To understand it, we first have to introduce something we call spatial predicates. These are functions that accept two geometries, and will apply a function to them that returns either TRUE or FALSE. Too technical? Let’s just look at an example.

The spatial predicate st_within() will tell you if geometry a is fully inside geometry b: is the Festung Hohensalzburg inside the municipality of Salzburg?

# Don't forget to put them in exactly the same CRS!
festung = castles %>% filter(Name == "Festung Hohensalzburg") %>% st_transform(st_crs(municipalities))
sbg_city = municipalities_sbg %>% filter(name == "Salzburg")

st_within(festung, sbg_city)
## Sparse geometry binary predicate list of length 1, where the predicate was `within'
##  1: 1

Answer: yes! (hint: 1 stands for TRUE). See here a description of all available spatial predicates.

Now what does this have to do with joins? We can use these spatial predicates to spatially join information from two spatial datasets! An example: we want to join festung with municipalities_sbg such that festung will include information about the municipality in which it is located.

festung %>%
  st_join(municipalities_sbg, join = st_within)
## Simple feature collection with 1 feature and 12 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 378648 ymin: 432897.7 xmax: 378648 ymax: 432897.7
## epsg (SRID):    31287
## proj4string:    +proj=lcc +lat_1=48.99999999999999 +lat_2=46 +lat_0=47.5 +lon_0=13.33333333333334 +x_0=400000 +y_0=400000 +ellps=bessel +towgs84=601.705,84.263,485.227,4.7354,-1.3145,-5.393,-2.3887 +units=m +no_defs
##                    Name    Adresse HNR      Ort  PLZ Gemeinde GEMNR
## 1 Festung Hohensalzburg Mönchsberg  34 Salzburg 5020 Salzburg 50101
##                                                          DBLink PDFLink
## 1 http://www.salzburg.gv.at/themen/ks/kultur/burgen/festung.htm    <NA>
##      id     name population                geometry
## 1 50101 Salzburg     148420 POINT (378648 432897.7)

NOTE: By default, st_join() does a left join. When you add left = FALSE, it will do an inner join.

So, that was a lot of reading…. Time to do something practical. Oh, and also for sf there is a cheatsheet.

Exercises

1. Wrangle the available data in such a way that you end up with a dataset attractions_sbg, which is a copy of the municipalities_sbg, but with a few extra columns:

  • castles describing the number of castles in that municipality

  • museums describing the number of museums in that municipality

  • attractions being the sum of the number of castles and museums

  • per_sqm being the number of attractions per square meter

  • per_person being the number of attractions per inhabitant

  • HINT: with mutate(columnname = replace_na(columnname, 0) you replace all NA values in a column with 0. You will need it!

Bonus

Are you already experienced in R and is this much to easy for you? Answer also these questions.

1. Which castles are a museum?

2. Filter the Salzburg province municipalities from the Austrian municipalities, by using the Provinces dataset, instead of the population CSV we used in the examples.

Visualize data

Visualizing data is a very important part of spatial data analysis. And this includes not only maps! For example, we can compare the area and population of municipalities in a scatterplot. Easiest, is with a simple base R plot:

# This is base R plotting. A simple scatterplot can be created with one line.
plot(area ~ population, data = attractions_sbg)

Very easy, but you will probably not impress your friends. ggplot2 is a tidyverse package that makes plotting more fancy. It works with layers, that you add together with +. In our example, we first initialize the plot (Which dataset? Which variables on the x and y axes?), and then add scatter points as a layer.

NOTE: You may have noticed that the area column has a unit associated with it. This is done by the units packages, with is imported by sf. Unfortunately, ggplot2 is not fancy enough to deal with this, so we’ll have to drop the units first.

# A more 'tidy' way to plot, with very nice results is ggplot2.
scatter = attractions_sbg %>%
  mutate(area = units::drop_units(area)) %>%
  ggplot(aes(x = population, y = area)) +
  geom_point()

scatter

As you see, ggplot2 works very well in combination with pipes. Also, you noticed of course that there is one point that makes our plot a little ugly (which municipality would that be…). You can remove this point, but the nice thing about ggplot2 is that you can also provide your plot to the plotly library, which will magically turn it into an interactive plot in just one line of code! Now you can zoom in to any area of your plot.

library(plotly)
ggplotly(scatter)

Back to mapping. We already learned how to make fast interactive maps with mapview. Static thematic maps can be made in ggplot2 (there is the geom_sf() layer for this, try it if you feel adventurous!), but tmap is a package that is even better suited for this task. Just as ggplot2, tmap works with different layers. First, you create a shape, then you decide how to visualize that shape. You can add several shapes on top of each other if you want! We cheat a little bit with the legend breaks such that this one annoying municipality does not bother us once more.

library(tmap)
map = tm_shape(attractions_sbg) +
  tm_polygons(col = 'population', style = 'jenks', palette = 'viridis') 

map

Exercises

1. Plot the same scatter plot again without the annoying municipality.

2. Explore relations between other variables with either a base R plot, or an (interactive) ggplot. Don’t be scared to also play around with colors or other style attributes. Find the ggplot cheatsheet here.

3. First, run tmap_mode('view'), and then create the same thematic map as in the example. What happens?

Write data

One thing we did not cover yet, is to export your data. This is made simple with st_write(). Let’s export our result as a GeoJSON.

# When the file already exists, force an overwrite.
st_write(attractions_sbg, 'result.geojson', delete_dsn = TRUE)
## Deleting source `result.geojson' using driver `GeoJSON'
## Writing layer `result' to data source `result.geojson' using driver `GeoJSON'
## Writing 119 features with 9 fields and geometry type Multi Polygon.

Shiny your data!

Feeling fancy? Open the file app.Rmd (which lives in the GitHub repository) and click on Run Document. Yes, you can also render an R Markdown into a web app! With not too much code we created a small interactive web app of the data. Nothing too fancy, but not too hard and scary either! If you still have time and feel adventurous, try to pimp the app. See here for some documentation.