First things first. To complete the exercises, you first have to install the R base system and RStudio Desktop. RStudio is an integrated development environment (IDE) for R, which will help you to have easier control over your R code. If you are in doubt about something, don’t hesitate to look at the RStudio Cheatsheet
Tip: Cheatsheets are a very useful resource to help you out. Find here a wide variety of cheatsheets for all kinds of R applications.
You can find all course materials in the maptimeR repository on GitHub. The easiest way to start with the exercises, is to clone this repository (of course, you can also fork it first, such that you can push your results to your own copy of the repository afterwards). If you don’t have Git, or don’t want to use it, you can manually download the files exercises.Rmd
and app.Rmd
and store it in a new directory named maptimeR
.
Now, you can open RStudio! The most convenient way to work on a specific task in RStudio, is to create a project for it. Each project has its own working directory, workspace, history, and source documents. If you cloned the git repository, a project file is already in there, and you can simply open it by clicking File -> Open Project
. If you only downloaded the exercises, you click instead on File -> New Project
, choose Existing Directory
, and select the maptimeR
directory in which you stored the exercises.
In the bottom right square of your screen, under the header Files
, you will see a list of all files that are part of your projects directory. One of them is exercises.Rmd
. The file extension .Rmd
stands for R Markdown. Markdown itself is a simple formatting language where you write your text in plain text rather than in complicated code like HTML or LateX. R Markdown enables you to embed ‘chunks’ of R code in such Markdown files. Like that, it is used for making dynamic documents with R. To work with it, we do need to install the rmarkdown
package. Do so by running install.packages("rmarkdown")
in the console pane, which you see in the bottom left square of your screen.
Now, you can open exercises.Rmd
by clicking on it. If you have a strong aversion against clicking, you can also run file.edit("exercises.Rmd")
in the console pane.
The exercises are divided into several chapters. Each chapter provides some explanation, accompanied with examples, and ends with a set of hands-on exercises. Some of these exercises can be answered with just text, but most of them require you to write R code. The most convenient way is to include your answers in the exercises.Rmd
file that you opened in RStudio during setup. Under each exercise, you can - on a new line - either type your textual answer in plain text, or create an R code chunk in which you write R code. You can see that each R code chunk starts with ```{r} and ends with ```. In between those two, you can put R code. You can execute the code of each individual chunk by clicking the green arrow in the top-right of the chunk, or by pressing Ctrl+Shift+Enter. Alternatively, you can execute all chunks in the whole document by pressing Ctrl+Alt+R.
Tip: In R Markdown files, you are not limited to chunks of R code, but you can also include chunks of, for example, Python and SQL code! See here.
The great thing about R Markdown is that you can easily render your document in a readable format, such as a web page, a PDF document, or a Word document. You can do this easiest with the Knit button in the menu just above the file. This makes it also much nicer to share your work! To get a basic understanding of how to work with R Markdown, please read and/or watch the R Markdown Quick Tour before you proceed! Also, you can always look at the R Markdown Cheatsheet.
Tip: If you want to export your file to PDF, you need to have LaTeX. You don’t have to leave RStudio for this! Install the R package tinytex with
install.packages("tinytex")
and then runtinytex::install_tinytex()
to install TinyTex on your computer. This is just if you want to try, rendering to HTML is what we’ll do in this course.
In the exercises, you will use several R packages. R packages contain (mainly) functions that extend the functionality of the base R language. They are written by developers who love to share their work and make your life easier! Packages can be installed from a central package repository called CRAN with install.packages("< package name >")
.
Any R developer who takes his work seriously, will probably upload his package to CRAN at some point. However, packages that are in earlier development will not be in the CRAN repository yet. Often, they just live on GitHub. With the remotes
package, you can install R packages directly from GitHub: remotes::install_github("< user/repo >")
.
Let’s install now the packages we will use (if you don’t already have them). Don’t worry, they are all on CRAN. Instead of calling install.packages()
multiple times, we can provide it a single vector (created as c()
) containing multiple package names.
install.packages(c(
"tidyverse", # a meta-package that contains a set of packages designed for data science
"sf", # for spatial vector data analysis
"tmap", # for advanced thematic mapping
"mapview", # for interactive mapping
"ggplot2", # for fancy plots
"plotly", # for interactive plots
"shiny", # for creating a web-app
"flexdashboard" # for including shiny in a dashboard
))
Installing a package, means that the package code is now on your computer, but it does not mean it is also loaded into your current R session. To do that, you will have to run library(< package name >)
. You can also run a function from an installed package without loading the package explicitly, by running < package name >::< function name >()
. When you used library()
, there is no need anymore to include < package name >::
before a function call, but you might still want to do it such that others (including your future self) know exactly to which package a used function belongs.
Tip: A nice thing about R and its packages is that the documentation is always available. If you want to know how a function works, just type
?
in front of any function on the console, and you will get the help page for it. e.g.?install.packages
.
We can load data directly from a URL, by using the download.file()
function. Here, we download the data only if it does not exist yet. Like that, you will not waste time on downloading a file that you already have!
# We will store the data in a folder called data.
# We only create this folder when it does not exist yet!
if (!dir.exists("data")) dir.create("data")
# Define where to get the data from, and where to store it.
src = "https://www.salzburg.gv.at/ogd/7b5df951-b8b0-4153-b599-490b4fe39d8b/Burgen_und_Schloesser_Shapefile.zip"
dst = "data/castles.zip"
# Download and unzip the file, if it does not exist yet.
if (!file.exists(dst)) {
download.file(src, dst)
unzip(dst, exdir = "data")
}
To read and manipulate vector data in R, we will use the sf
package. First, we load the package into our current R session, such that we can use its functions directly.
library(sf)
## Linking to GEOS 3.7.1, GDAL 2.2.2, PROJ 4.9.2
You see that when loading the package, you get informed that it links to respectively GDAL, GEOS and PROJ. These are the well-known geospatial workhorse libraries that get used by a lot of software (including ArcGIS, QGIS, PostGIS and the spatial packages of Python). GDAL can read, write and convert a wide variety of geospatial file formats, GEOS provides functions for two-dimensional geometry operations (e.g. intersections, unions, buffers), and PROJ takes care of coordinate reference systems. What sf
does, is bringing the functionalities of these libraries into R, in a clear, unified structure.
For data reading, this means that sf
can read all data formats that are understood by GDAL. It also supports loading data from databases and directly from URLs. Data reading is now as simple as just providing a file location to the read function. It will (try to) automatically determine what type of format it is, and act accordingly.
# Read the downloaded shapefile.
castles = st_read("data/Burgen_und_Schloesser/Burgen_und_Schloesser.shp")
## Reading layer `Burgen_und_Schloesser' from data source `/home/luuk/Documents/maptime/maptimeR/data/Burgen_und_Schloesser/Burgen_und_Schloesser.shp' using driver `ESRI Shapefile'
## Simple feature collection with 104 features and 9 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 369853.9 ymin: 215126.1 xmax: 488768.8 ymax: 314745.1
## epsg (SRID): NA
## proj4string: +proj=tmerc +lat_0=0 +lon_0=13.33333333333333 +k=1 +x_0=450000 +y_0=-5000000 +ellps=bessel +units=m +no_defs
castles
is now an sf
object that lives in your R environment, which you can see in the top-right pane of RStudio.
Note: When you are a Windows user, installing
sf
brings GDAL, GEOS and PROJ along. However, if you use Linux, you have to have these libraries installed on your system (we assume that there are no geospatial Linux users who don’t have this, though). See here for the sf installation guide.
In spatial data science, you can unfortunately not get around dirty CSV files that you’ll have to join with your spatial data, so let’s also show how to load such a guy into R. read.csv()
does the job for you.
# Of course, governments love to put multi-line headers in CSV files (please don't!)
# Therefore, we have to read starting from the 3rd line. Hence, skip = 2.
population_sbg = read.csv("https://www.salzburg.gv.at/ogd/f3148c9c-4e14-4258-b00f-b7695b75d480/population-szg-sex-age-city.csv", skip = 2, sep = ";")
Again, population_salzburg now lives in your R environment. This time not as a spatial sf
object, but as a regular data.frame
. The file contains information about the population of each municipality in the province of Salzburg, for several years.
st_drivers()
in the console to see all file formats supported by sf
.Before analyzing data, it is very important to know what you’re dealing with! Therefore, you should always do a quick exploration of your data. How many observations and variables do I have? What is the CRS of the data? How does it look like on a map? Et cetera.
A core data structure in R is the so-called data frame, which is very similar to a database table. We call a data frame tidy if each row is an observation, and each column a variable. An sf
object extends such a data frame, adding a special/spatial column, in which the geometry of each observation is stored. This geometry is called sticky, since it always sticks to the observation, no matter what you do. For example, if you select a specific variable, the geometry column will not be removed. Additionally, each sf
object contains information about some spatial properties of the data, for example: the geometry type, the spatial extent (i.e. the bounding box), and the coordinate reference system.
(Illustration by Allison Horst)
The most common geometry types supported by sf
are the following:
type | description |
---|---|
POINT |
zero-dimensional geometry containing a single point |
LINESTRING |
sequence of points connected by straight, non-self intersecting line pieces; one-dimensional geometry |
POLYGON |
geometry with a positive area (two-dimensional); sequence of points form a closed, non-self intersecting ring; the first ring denotes the exterior ring, zero or more subsequent rings denote holes in this exterior ring |
MULTIPOINT |
set of points; a MULTIPOINT is simple if no two Points in the MULTIPOINT are equal |
MULTILINESTRING |
set of linestrings |
MULTIPOLYGON |
set of polygons |
GEOMETRYCOLLECTION |
set of geometries of any type except GEOMETRYCOLLECTION |
(Source: sf documentation)
For PostGIS users, these types probably seem familiar. That is because sf
implements a formal standard on how objects in the real world can be represented and stored in computers, and which geometrical operations should be defined for them. This standard is called Simple Feature Access. So, now you also know why sf
is called sf
!
Okay, enough theory. How do we access all this information, such that we can explore the data? The head()
function is a good place to start. It shows you the first few rows of a data frame.
head(castles)
## Simple feature collection with 6 features and 9 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 428227.9 ymin: 260199.2 xmax: 432843 ymax: 295754.9
## epsg (SRID): NA
## proj4string: +proj=tmerc +lat_0=0 +lon_0=13.33333333333333 +k=1 +x_0=450000 +y_0=-5000000 +ellps=bessel +units=m +no_defs
## Name Adresse HNR Ort PLZ Gemeinde
## 1 Schloss Anif Salzachtal-Bundesstraße 9 Anif 5081 Anif
## 2 Schloss Blühnbach Blühnbachstraße 53 Tenneck 5451 Werfen
## 3 Schloss Aigen Schwarzenbergpromenade 37 Salzburg 5026 Salzburg
## 4 Schloss Arenberg Arenbergstraße 10 Salzburg 5020 Salzburg
## 5 Edmundsburg Mönchsberg 2 Salzburg 5020 Salzburg
## 6 Schloss Emsburg Hellbrunner Allee 52 Salzburg 5020 Salzburg
## GEMNR DBLink
## 1 50301 http://www.burgen-austria.com/Archiv.asp?Artikel=Anif
## 2 50424 http://www.burgen-austria.com/Archiv.asp?Artikel=Blühnbach
## 3 50101 http://www.salzburg.gv.at/themen/ks/kultur/burgen/aigen.htm
## 4 50101 http://www.salzburg.gv.at/themen/ks/kultur/burgen/arenberg.htm
## 5 50101 http://www.salzburg.gv.at/themen/ks/kultur/burgen/edmundsburg.htm
## 6 50101 http://www.salzburg.gv.at/themen/ks/kultur/burgen/emsburg.htm
## PDFLink geometry
## 1 <NA> POINT (430134.6 289692.9)
## 2 <NA> POINT (432843 260199.2)
## 3 <NA> POINT (431729 294215.6)
## 4 <NA> POINT (429486.6 295754.9)
## 5 <NA> POINT (428227.9 295454.4)
## 6 <NA> POINT (429678.7 292559.1)
You can retrieve the total number of rows and columns by running dim()
.
dim(castles)
## [1] 104 10
These are both basic R functions. To access the spatial properties of an sf
object, you’ll need specific functions from the sf
package. For example, st_crs()
shows you information on the coordinate reference system. Similarly, st_bbox()
shows the spatial extent of the data.
st_crs(castles)
## Coordinate Reference System:
## No EPSG code
## proj4string: "+proj=tmerc +lat_0=0 +lon_0=13.33333333333333 +k=1 +x_0=450000 +y_0=-5000000 +ellps=bessel +units=m +no_defs"
st_bbox(castles)
## xmin ymin xmax ymax
## 369853.9 215126.1 488768.8 314745.1
You can extract only the geometry of the data (without all attributes) with the st_geometry()
function. This is helpful when you just want to quickly see how the data points are aligned in space. You do so by plotting the geometry of the data.
plot(st_geometry(castles))
However, nowadays it is just as easy to create a quick interactive map of your data, and explore them interactively. The mapview
package is a great option to for doing this in just one line of code. Click on a point and check the attributes!
library(mapview)
mapview(castles)
Now we have reached the fun part, where the real analysis starts! With data wrangling we mean cleaning, filtering, transforming and joining datasets such that we get the results we want. In R, there exists a set of packages, known as the tidyverse, meant to make your data wrangling life easier and more intuitive. All these packages share an underlying design philosophy, and therefore interact well with each other. You can load all of the core tidyverse packages at once by running library(tidyverse)
. You will see that eight packages are loaded at once, at that two base R functions are overwritten by tidyverse functions with the same name.
library(tidyverse)
## ── Attaching packages ─────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0.9000 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The main data wrangling workhorse of the tidyverse is the dplyr
package. The most important functions of that package are:
function | description |
---|---|
filter() |
selecting specific observations, i.e. rows |
select() |
selecting specific variables, i.e. columns |
mutate() |
adding a new variable, possibly as a function of other variables |
group_by() |
splitting a dataset in groups, based on a variable |
summarise() |
summarise a dataset by applying a function to a variable, e.g. mean, sum, .. |
*_join() |
joining two datasets based on a variable. Can be a left_join, inner_join, et cetera |
Iconic for the tidyverse design is the so-called pipe function: %>%
. The pipe is meant to make your workflow of applying functions to data more intuitive. Normally, you read code (in R, but also in Python and other languages) from the inside to the outside. That is, if you want to select some observations from your data, then select a specific variable, and then summarise by calculating the average value of this variable for your selected set of observations, you would write this somehow like:
summarise(select(filter(data)))
The pipe command, however, forwards the result of one function as first argument to the next function. In that way, you would write the same workflow like:
data %>% filter %>% select %>% summarise
Remember that there is always a cheatsheet to help you out when your are lost: dplyr cheatsheet.
NOTE: The tidyverse package are an extension of base R, and not a replacement! Almost everything you do with the tidyverse, you can also do without loading any packages. The tidyverse just makes your work easier to structure and easier to understand. At least, that is what we think.
Now, let’s showcase the functionality of all core dplyr functions and the pipe. Remember that our dataset population_salzburg
contained information about the population of each municipality in the province of Salzburg, for several years. Each municipality has two observations (i.e. rows) per year, one for female population, one for male population. The most recent year in the data is 2015. Remember also that you loaded the shapefile of municipalities in Austria in a previous exercise. So, let’s formulate the following goal: we want to have a spatial dataset (i.e. including geometry) of all municipalities in the province of Salzburg, containing a variable that describes total population (male plus female) in that municipality.
First, we filter the observations of the population data such that we only have the most recent year. Then, we select only the variables of interest, which are CITY_CODE
describing the unique id of a municipality, SEX
describing if the observation regards males or females and POP_TOTAL
describing the total population observed.
population_sbg_2015 = population_sbg %>%
filter(REF_YEAR == 2015) %>%
select(CITY_CODE, SEX, POP_TOTAL)
head(population_sbg_2015)
## CITY_CODE SEX POP_TOTAL
## 1 50101 1 70385
## 2 50101 2 78035
## 3 50201 1 2897
## 4 50201 2 2904
## 5 50202 1 1763
## 6 50202 2 1726
Now, we still have two lines per municipality (i.e. one for male, one for female), which we want to add up. That is: we group the data by municipality (column CITY_CODE
), and then for each group, summarise the POP_TOTAL
column by calculating the sum of its values.
total_population_sbg_2015 = population_sbg_2015 %>%
group_by(CITY_CODE) %>%
summarise(population = sum(POP_TOTAL))
head(total_population_sbg_2015)
## # A tibble: 6 x 2
## CITY_CODE population
## <int> <int>
## 1 50101 148420
## 2 50201 5801
## 3 50202 3489
## 4 50203 2236
## 5 50204 4221
## 6 50205 20607
Great! That looks useful. Now we have to somehow add the geometries of the municipalities. These are stored in the municipalities dataset, where the equivalent to the CITY_CODE
column is called id
. However, there is one last technical issue we have to solve before we can join the two. We are dealing with governmental data, so of course, it is messy. The data type of CITY_CODE
in total_population_sbg_2015
is not equal to the data type of id
in municipalities
. You do not have to worry here about understanding data types in detail, but you do have to know that we have to make them identical before joining. Therefore, we will create a new column in total_population_sbg_2015
which overwrites the old CITY_CODE
, keeping the same values, but changing the data type.
total_population_sbg_2015 = total_population_sbg_2015 %>%
mutate(CITY_CODE = as.factor(CITY_CODE))
Finally, time to join! The most used joins are a left_join
and an inner_join
.
type | description |
---|---|
inner_join |
return all rows from x where there are matching values in y, and all columns from x and y. |
left_join |
return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA (i.e. ‘not available’) values in the new columns. |
(Source: dplyr documentation)
For our case now, we want to keep only the municipalities in Salzburg from the Austrian wide municipalities dataset, and add the population information to them. That is, we want to return all rows from municipalities
where there are matching values in total_population_sbg_2015
. So that is a….inner_join
!
Note: A
left_join(total_population_sbg_2015, municipalities)
would in this case have given the same result (think about why!), but it would have kept the type of the population dataset, which was adata.frame
, and not a spatialsf
object. Hence, you would have had to convert the result to ansf
object afterwards.
municipalities_sbg = municipalities %>%
inner_join(total_population_sbg_2015, by = c('id' = 'CITY_CODE'))
head(municipalities_sbg)
## Simple feature collection with 6 features and 3 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 374015 ymin: 395438.6 xmax: 413827.4 ymax: 439501.5
## epsg (SRID): 31287
## proj4string: +proj=lcc +lat_1=48.99999999999999 +lat_2=46 +lat_0=47.5 +lon_0=13.33333333333334 +x_0=400000 +y_0=400000 +ellps=bessel +towgs84=601.705,84.263,485.227,4.7354,-1.3145,-5.393,-2.3887 +units=m +no_defs
## id name population geometry
## 1 50101 Salzburg 148420 MULTIPOLYGON (((380697.9 43...
## 2 50201 Abtenau 5801 MULTIPOLYGON (((406462.1 40...
## 3 50202 Adnet 3489 MULTIPOLYGON (((386576.8 41...
## 4 50203 Annaberg-Lungötz 2236 MULTIPOLYGON (((406416.4 40...
## 5 50204 Golling an der Salzach 4221 MULTIPOLYGON (((387546.6 40...
## 6 50205 Hallein 20607 MULTIPOLYGON (((384067.8 41...
mapview(municipalities_sbg)
This seems to have worked very well! Now you know how to use the most important dplyr
functions, time to move them to the spatial domain. Because after all, the join we just did, was based on an id that was present in both datasets, not on any explicit spatial properties.
sf
is full of functions to wrangle spatial data. You may have already noticed that most of these functions start with st_
. Again, if your are a PostGIS user, this will look familiar. Not a coincidence: remember that sf
is largely based on official standards, that also define valid geometrical operations. Some examples:
st_transform()
will transform the data into a different CRS:
# Castles to EPSG:3035
castles %>%
st_transform(crs = 3035) %>%
st_crs()
## Coordinate Reference System:
## EPSG: 3035
## proj4string: "+proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
# You can also use the CRS of another dataset.
castles %>%
st_transform(crs = st_crs(municipalities)) %>%
st_crs()
## Coordinate Reference System:
## EPSG: 31287
## proj4string: "+proj=lcc +lat_1=48.99999999999999 +lat_2=46 +lat_0=47.5 +lon_0=13.33333333333334 +x_0=400000 +y_0=400000 +ellps=bessel +towgs84=601.705,84.263,485.227,4.7354,-1.3145,-5.393,-2.3887 +units=m +no_defs"
st_area()
will calculate the area of a geometry:
# Calculate the area of the municipality of Salzburg city.
municipalities_sbg %>%
filter(name == "Salzburg") %>%
st_area()
## 65641400 [m^2]
st_buffer()
draws a buffer around a geometry:
# Draw a buffer of 500m around the Festung Hohensalzburg.
castles %>%
filter(Name == "Festung Hohensalzburg") %>%
st_buffer(500) %>%
mapview()
st_union()
merges several geometries into one:
# Merge all municipalities in the Salzburg province.
municipalities_sbg %>%
st_union() %>%
mapview()
Hopefully you noticed now that sf
functions (and also mapview
) work really well together with dplyr
and pipes. Some dplyr
functions work slightly different on sf
objects, because of the sticky geometry that we talked about before. For example, selecting a variable will always keep the geometry along.
# Here, we select only Name, but the geometry sticks with it.
castles %>%
select(Name)
## Simple feature collection with 104 features and 1 field
## geometry type: POINT
## dimension: XY
## bbox: xmin: 369853.9 ymin: 215126.1 xmax: 488768.8 ymax: 314745.1
## epsg (SRID): NA
## proj4string: +proj=tmerc +lat_0=0 +lon_0=13.33333333333333 +k=1 +x_0=450000 +y_0=-5000000 +ellps=bessel +units=m +no_defs
## First 10 features:
## Name geometry
## 1 Schloss Anif POINT (430134.6 289692.9)
## 2 Schloss Blühnbach POINT (432843 260199.2)
## 3 Schloss Aigen POINT (431729 294215.6)
## 4 Schloss Arenberg POINT (429486.6 295754.9)
## 5 Edmundsburg POINT (428227.9 295454.4)
## 6 Schloss Emsburg POINT (429678.7 292559.1)
## 7 Schloss Frohnburg POINT (429652.7 293224.2)
## 8 Schloss Freisaal POINT (429378.2 294304.9)
## 9 Schloss Hellbrunn POINT (429582.8 291547)
## 10 Schloss Hellbrunn - Monatsschlössl POINT (429860.5 291272.8)
Summarising a variable will also summarise the geometry. That is, when you summarise a variable for a set of points, the POINT
geometry of all these points will be merged together into a single MULTIPOINT
geometry. See it for yourself:
# Count all the castles in the data. Counting is done with the function n().
castles %>%
summarise(total_count = n())
## Simple feature collection with 1 feature and 1 field
## geometry type: MULTIPOINT
## dimension: XY
## bbox: xmin: 369853.9 ymin: 215126.1 xmax: 488768.8 ymax: 314745.1
## epsg (SRID): NA
## proj4string: +proj=tmerc +lat_0=0 +lon_0=13.33333333333333 +k=1 +x_0=450000 +y_0=-5000000 +ellps=bessel +units=m +no_defs
## total_count geometry
## 1 104 MULTIPOINT (369853.9 235736...
Here, you are probably not so interested in this MULTIPOINT
geometry. No worries, you can easily get rid of the geometry by setting it to NULL
. Of course, the data will then lose its spatial properties.
castles %>%
summarise(total_count = n()) %>%
st_set_geometry(NULL)
## total_count
## 1 104
Finally, also the join
functionality has a spatial counterpart: st_join()
. To understand it, we first have to introduce something we call spatial predicates. These are functions that accept two geometries, and will apply a function to them that returns either TRUE
or FALSE
. Too technical? Let’s just look at an example.
The spatial predicate st_within()
will tell you if geometry a is fully inside geometry b: is the Festung Hohensalzburg inside the municipality of Salzburg?
# Don't forget to put them in exactly the same CRS!
festung = castles %>% filter(Name == "Festung Hohensalzburg") %>% st_transform(st_crs(municipalities))
sbg_city = municipalities_sbg %>% filter(name == "Salzburg")
st_within(festung, sbg_city)
## Sparse geometry binary predicate list of length 1, where the predicate was `within'
## 1: 1
Answer: yes! (hint: 1 stands for TRUE
). See here a description of all available spatial predicates.
Now what does this have to do with joins? We can use these spatial predicates to spatially join information from two spatial datasets! An example: we want to join festung
with municipalities_sbg
such that festung
will include information about the municipality in which it is located.
festung %>%
st_join(municipalities_sbg, join = st_within)
## Simple feature collection with 1 feature and 12 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 378648 ymin: 432897.7 xmax: 378648 ymax: 432897.7
## epsg (SRID): 31287
## proj4string: +proj=lcc +lat_1=48.99999999999999 +lat_2=46 +lat_0=47.5 +lon_0=13.33333333333334 +x_0=400000 +y_0=400000 +ellps=bessel +towgs84=601.705,84.263,485.227,4.7354,-1.3145,-5.393,-2.3887 +units=m +no_defs
## Name Adresse HNR Ort PLZ Gemeinde GEMNR
## 1 Festung Hohensalzburg Mönchsberg 34 Salzburg 5020 Salzburg 50101
## DBLink PDFLink
## 1 http://www.salzburg.gv.at/themen/ks/kultur/burgen/festung.htm <NA>
## id name population geometry
## 1 50101 Salzburg 148420 POINT (378648 432897.7)
NOTE: By default,
st_join()
does a left join. When you addleft = FALSE
, it will do an inner join.
So, that was a lot of reading…. Time to do something practical. Oh, and also for sf
there is a cheatsheet.
attractions_sbg
, which is a copy of the municipalities_sbg
, but with a few extra columns:castles
describing the number of castles in that municipality
museums
describing the number of museums in that municipality
attractions
being the sum of the number of castles and museums
per_sqm
being the number of attractions per square meter
per_person
being the number of attractions per inhabitant
HINT: with mutate(columnname = replace_na(columnname, 0)
you replace all NA
values in a column with 0. You will need it!
Are you already experienced in R and is this much to easy for you? Answer also these questions.
Visualizing data is a very important part of spatial data analysis. And this includes not only maps! For example, we can compare the area and population of municipalities in a scatterplot. Easiest, is with a simple base R plot:
# This is base R plotting. A simple scatterplot can be created with one line.
plot(area ~ population, data = attractions_sbg)
Very easy, but you will probably not impress your friends. ggplot2
is a tidyverse package that makes plotting more fancy. It works with layers, that you add together with +
. In our example, we first initialize the plot (Which dataset? Which variables on the x and y axes?), and then add scatter points as a layer.
NOTE: You may have noticed that the area column has a unit associated with it. This is done by the
units
packages, with is imported bysf
. Unfortunately,ggplot2
is not fancy enough to deal with this, so we’ll have to drop the units first.
# A more 'tidy' way to plot, with very nice results is ggplot2.
scatter = attractions_sbg %>%
mutate(area = units::drop_units(area)) %>%
ggplot(aes(x = population, y = area)) +
geom_point()
scatter
As you see, ggplot2
works very well in combination with pipes. Also, you noticed of course that there is one point that makes our plot a little ugly (which municipality would that be…). You can remove this point, but the nice thing about ggplot2
is that you can also provide your plot to the plotly
library, which will magically turn it into an interactive plot in just one line of code! Now you can zoom in to any area of your plot.
library(plotly)
ggplotly(scatter)
Back to mapping. We already learned how to make fast interactive maps with mapview
. Static thematic maps can be made in ggplot2
(there is the geom_sf()
layer for this, try it if you feel adventurous!), but tmap
is a package that is even better suited for this task. Just as ggplot2
, tmap
works with different layers. First, you create a shape, then you decide how to visualize that shape. You can add several shapes on top of each other if you want! We cheat a little bit with the legend breaks such that this one annoying municipality does not bother us once more.
library(tmap)
map = tm_shape(attractions_sbg) +
tm_polygons(col = 'population', style = 'jenks', palette = 'viridis')
map
tmap_mode('view')
, and then create the same thematic map as in the example. What happens?One thing we did not cover yet, is to export your data. This is made simple with st_write()
. Let’s export our result as a GeoJSON.
# When the file already exists, force an overwrite.
st_write(attractions_sbg, 'result.geojson', delete_dsn = TRUE)
## Deleting source `result.geojson' using driver `GeoJSON'
## Writing layer `result' to data source `result.geojson' using driver `GeoJSON'
## Writing 119 features with 9 fields and geometry type Multi Polygon.
Feeling fancy? Open the file app.Rmd
(which lives in the GitHub repository) and click on Run Document
. Yes, you can also render an R Markdown into a web app! With not too much code we created a small interactive web app of the data. Nothing too fancy, but not too hard and scary either! If you still have time and feel adventurous, try to pimp the app. See here for some documentation.
We only covered a small part of all the things that are possible in the spatial R world. If you are interested and want to know more, you are lucky. There is an incredible amount of free resources online (almost all of them created with R Markdown, by the way). Here are some links that might be useful: