Exercise 2: Data quality control and hourly series.
Details | |
---|---|
Data |
Edited raw data from Exercise #1 |
Challanges |
Create a flagging system to ensure data quality |
The aims of this exercise are
data_frame
in R
and,At the end of this exercise you should have a quality-controlled hourly temperature series for the period 2019-12-14
until 2020-01-06
. It is not a problem if your series is shorter (i.e. missing days at the begin or end of the series), but be sure that you only include full days (i.e. each day has 24 values). You can generate a corresponding date/time vector easily with lubridate
, e.g.:
library(lubridate)
seq(ymd_hm('2018-12-10 00:00'),ymd_hm('2019-01-05 23:50'), by = '10 mins')
seq(ymd_hm('2019-12-14 00:00'),ymd_hm('2020-01-06 23:00'), by = '1 hour')
Report: The research questions here are: “How good was your temperature measurement over the 24 days and why have been certain periods flagged as bad or low-quality data?” and “Are the regression models between your HOBO station and the reference stations different and what are reasons for the differences?”. Describe how you develop the quality flagging system and present the results, i.e. how many data points are flagged by the different quality control tests? Present the outcomes of the regression models and state shortly how you have implemented them into R. Compare the regression models and give some reasons why you decided at the end for one specific regression model (i.e. one reference station) to fill the gaps in your HOBO series that result from quality control. Review the effect of the gap filling on your new hourly time series, i.e. is the model useful/appropriate to fill the gaps of your series? If possible, investigate the regression models more detailed, e.g. quality of fitting for day/night, colder/warmer periods and look on the residuals of the models and discuss outliers (if there are any). Dont forget to use the meta information of your HOBO location during discussion of the temperature series (e.g. exposition). Additionally is might be valuable to consider also the light intensity series to explain the variability in temperture.
Use your edited HOBO file from Exercise #1. During quality control check means to check if a certain condition is fulfilled or not, flag means a systematic flagging of bad data points when te check fails.
Check each temperature data point to be in the measurement range. See the HOBO manual for specification.
Check each temperature data point to have not more than 1K temperature change compared to the previous data point. Check out the lead()
and lag()
functions in the dplyr
-package and try to use them in a mutate
-command.
If temperature has not changed during the last 60 minutes (i.e. data point \(T_i\) plus 5 data points before from \(T_{i-1}\) to \(T_{i-5}\)) flag the corresponding data point \(T_i\) as bad data point.
The last quality check uses measured light intensity to flag bad data points.
Description | Values or range [lux] |
---|---|
Full moon or clear night | 0.1-1.0 |
Sunrise or sunset | 50 - 500 |
Typical overcast day at midday | 1.000-2.000 |
Clear sky luminance | 25.000 |
Full daylight without direct sun | 10.000 - 25.000 |
Shade illuminated by entire clear blue sky, midday | 20.000 |
Bright sunshine | 110.000 |
Typical range of illumination by sun light | 32.000 - 130.000 |
Define two threshold \(L_1\) [lux] and \(L_2\) [lux] with \(L_2\) > \(L_1\):
Note: To ensure that you only filter out data points influenced by sunlight/radiation this check on light intensity should only excuted during day-time (6am - 6pm). One possibility: Run the test on all data and then overwrite flasely flagged data point in the night. If your maximum light intensity is really low (e.g. due to a really shaded/dark location) you have to adjust \(L_1\) and \(L_2\) to low(er) values. One possibility is to use the 95th and 99th percentile of your light intensity data as thresholds. Then, investigate the temperature changes when light intesity is above these thresholds. If you consider these temperature data as influenced than flag the data points. You can compare suspicoius periods with the same periods on other days.
Hints: Compare relationships between light intensity and temperature across different HOBOs. Look at the distribution of measured light intensity and compare the highest x % with the classes in the table. Look at your temperature changes during high light intensity. Consider your HOBO exposition and compare light intensity of warm days (e.g. above average) with cold(er) days. Focus on hours when high(er) light intensity could be expected. Sort your data_frame with descending Lux values (i.e. %>% arrange(...)
) and investigate the temperature values (and step changes…) for the Top-10, Top-50, Top-100 Lux values.
It might be helpful to use rollaplly
from zoo
-package with a specific width \(w\) and a specific setting for align = “…” and fill=“…”, see the following code snippet. \(w\) can be set to the total number of points that should be flagged (before and after the actual data point).
df %>% mutate(lux_th1 = if_else(lux > 30.000, 1, 0)) %>%
mutate(lux_th2 = rollapply(lux_th1, w, FUN = sum,...))
Check the sums in lux_th2
and think about hjow to use this information to flag a data point or not. Same procedure can be done for the other light intensity threshold.
Read carefully the help of the functionrollapply
, width can also be used with a list not only with a vector (see Details there).
Ideally you have now a data_frame
with the measured data and separated columns for each quality check. If a data point fails at least one check the data point is flagged. If one hour of data has one or no flag you can aggregate the 10-minute values to hourly averages. If a data point has two or more flags the hour is considered as erroneous and the data point in the generated hourly temperature series is set to NA. NA-values have to be filled with the regression model to gain a hourly temperature series without NA’s.
Example of a data quality check flagging system with different quality checks in qc1
, qc2
, qc3
. Use the flags in a systematic way, i.e. 0=ok, 1=fails. You can also use boolean values (TRUE
,FALSE
) and calculate the sums of the qc-columns in a qc_total
-column.
dttm | ta | lux | qc1 | qc2 | qc3 | … | qc_total |
---|---|---|---|---|---|---|---|
2018-12-24 12:00 | 3.7 | 500 | 0 | 0 | 1 | … | 1 |
2018-12-24 12:10 | 3.8 | 4000 | 0 | 1 | 0 | … | 1 |
2018-12-24 12:20 | 3.9 | 20500 | 1 | 0 | 1 | … | 2 |
2018-12-24 12:30 | 5.7 | 110500 | 1 | 1 | 1 | … | 3 |
2018-12-24 12:40 | 4.2 | 22500 | 0 | 1 | 1 | … | 2 |
2018-12-24 12:50 | 4.1 | 32500 | 0 | 0 | 1 | … | 1 |
Small statistical analysis: Calculate how often your different quality checks flag the data and discuss the outcome of the different checks briefly in the exercise text.
Aim is to fill the gaps in your HOBO series with data from reference series. NA
-values originate from a) missing measurements and/or b) low quality data. At the end you must have an hourly temperature series without any gaps/missing values.
Helpful functions to generate hourly series:
date
, hour
(lubridate)mutate
, summarise
, if_else
(dplyr)Use group_by
to generate unique groups in your tibble.
Download hourly air temperature data from the three reference stations and give the distances to your station in your exercise text. Data should span exactly the same period as your series and must have the same temporal resolution (hour):
WBI: http://www.wetter-bw.de DWD: ftp://ftp-cdc.dwd.de/pub/CDC/ or https://www.dwd.de/EN/climate_environment/cdc/cdc_node.html
Data from DWD Station #13667 is located in the Urban Climate folder on the DWD FTP server. Depending on your computer, browser and OS the FTP server directories will show up in the browser and/or Finder/Explorer. With Google Chrome a web-view oft the FTP directory is possible.
Load all reference series into R
into a data_frame
together with your HOBO temperature series and check for temporal consistency. Aim is to find the best regression model (i.e best fitting reference station)
Build three linear regression models lm(y~x)
with your hourly temperature series (HOBO) and the corresponding WBI/DWD data. With the parameters intercept \(a\) and slope \(b\) you should predict()
the missing temperature data points in the HOBO series (\(y\)) with corresponding data from the reference stations (WBI/DWD) (\(x\)).
\[ y = a + bx \]
Extract the coefficient of determination (\(R^2\)) and analyze the model residuals visually for the regression models. Discuss the differences (i.e. over- versus underestimation, \(R^2\), outliers, distribution of residuals, distance from your location to reference stations) and argue which of the three regression models is used to fill your data gaps in the HOBO series. Read ?residuals
in the R
help. Suggestion: Insert a scatterplot showing the relationship between your HOBO data and the reference data and the regression models.
My \(R^2\) is relatively low and the relationship between my HOBO data and the reference date looks weak - what should I do?
Your temperature series might be a very local snapshot of temperature in Freiburg. Distance to reference station might be a reason for weak(er) relationships, but a near distance does not mean that both time series should be identical.
If the \(R^2\) is quite the same for all reference stations, try to plot all temperature series together in on plot to investigate the differences. Think also about the differences in intercept a and slobe b among the regression models.
Very often it is useful to plot HOBO series and the reference series together in one plot to see what is going on. Have you time shifts in there series (due to issues during reading the data in)? Time shifts will of course reduce your \(R^2\).
At the end you have to choose one regression model. Discuss briefly if there is a temporal pattern with periods where the regression model works better (i.e. useful for higher/lower temperature, useful for beginn/end or day/night of your time series, useful for specific periods during measurement).
Note: For Exercise #3 the amount of hours with NA is needed.
After the quality checks and the gap filling with reference stations upload your hourly HOBO time series with the name hoboid_Th.tsv
(e.g. 10350099_Th.tsv
) to the Github data repository.
Upload here: https://github.com/data-hydenv/data/tree/master/hobo/2020/_02hourly
The file must contain four columns:
Column seperator must be a tab/tabulator. Column names see below:
date | hour | th | origin |
---|---|---|---|
2018-12-24 | 15 | 2.388 | H |
2018-12-24 | 16 | 2.428 | H |
2018-12-24 | 17 | 2.101 | R |
2018-12-24 | 18 | 2.275 | H |
First data point is “2018-12-14 00”, last data point is “2020-01-06 23”, in total 576 data points/hours (24 days).
library(lubridate)
seq(ymd_hm('2018-12-14 00:00'),ymd_hm('2019-01-06 23:00'), by = 'hour')