HOBO QC and hourly series

Exercise 2: Data quality control and hourly series.

Michael Stölzle mailto:michael.stoelzle@hydro.uni-freiburg.de (University of Freiburg)
2020-01-13

Table of Contents


Details
Data Edited raw data from Exercise #1
Challanges Create a flagging system to ensure data quality

Aims

The aims of this exercise are

At the end of this exercise you should have a quality-controlled hourly temperature series for the period 2019-12-14 until 2020-01-06. It is not a problem if your series is shorter (i.e. missing days at the begin or end of the series), but be sure that you only include full days (i.e. each day has 24 values). You can generate a corresponding date/time vector easily with lubridate, e.g.:


library(lubridate)
seq(ymd_hm('2018-12-10 00:00'),ymd_hm('2019-01-05 23:50'), by = '10 mins')
seq(ymd_hm('2019-12-14 00:00'),ymd_hm('2020-01-06 23:00'), by = '1 hour')

Report: The research questions here are: “How good was your temperature measurement over the 24 days and why have been certain periods flagged as bad or low-quality data?” and “Are the regression models between your HOBO station and the reference stations different and what are reasons for the differences?”. Describe how you develop the quality flagging system and present the results, i.e. how many data points are flagged by the different quality control tests? Present the outcomes of the regression models and state shortly how you have implemented them into R. Compare the regression models and give some reasons why you decided at the end for one specific regression model (i.e. one reference station) to fill the gaps in your HOBO series that result from quality control. Review the effect of the gap filling on your new hourly time series, i.e. is the model useful/appropriate to fill the gaps of your series? If possible, investigate the regression models more detailed, e.g. quality of fitting for day/night, colder/warmer periods and look on the residuals of the models and discuss outliers (if there are any). Dont forget to use the meta information of your HOBO location during discussion of the temperature series (e.g. exposition). Additionally is might be valuable to consider also the light intensity series to explain the variability in temperture.

1. Quality control

Use your edited HOBO file from Exercise #1. During quality control check means to check if a certain condition is fulfilled or not, flag means a systematic flagging of bad data points when te check fails.

1.1 Measurement range (Plausible values)

Check each temperature data point to be in the measurement range. See the HOBO manual for specification.

1.2 Plausible rate of change

Check each temperature data point to have not more than 1K temperature change compared to the previous data point. Check out the lead() and lag() functions in the dplyr-package and try to use them in a mutate-command.

1.3 Minimum variability (Persistence)

If temperature has not changed during the last 60 minutes (i.e. data point \(T_i\) plus 5 data points before from \(T_{i-1}\) to \(T_{i-5}\)) flag the corresponding data point \(T_i\) as bad data point.

1.4 Light intensity

The last quality check uses measured light intensity to flag bad data points.

Description Values or range [lux]
Full moon or clear night 0.1-1.0
Sunrise or sunset 50 - 500
Typical overcast day at midday 1.000-2.000
Clear sky luminance 25.000
Full daylight without direct sun 10.000 - 25.000
Shade illuminated by entire clear blue sky, midday 20.000
Bright sunshine 110.000
Typical range of illumination by sun light 32.000 - 130.000

Define two threshold \(L_1\) [lux] and \(L_2\) [lux] with \(L_2\) > \(L_1\):

  1. Flag a temperature value \(T_i\) (and also \(T_{i-1}\) and \(T_{i+1}\)) if measured light intensity is higher than \(L_1\) (i.e. 3 values in total are flagged)
  2. Flag a temperature value \(T_i\) (and also \(T_{i-3}\) - \(T_{i+3}\)) if measured light intensity is higher than \(L_2\) (i.e. 7 values in total are flagged)

Note: To ensure that you only filter out data points influenced by sunlight/radiation this check on light intensity should only excuted during day-time (6am - 6pm). One possibility: Run the test on all data and then overwrite flasely flagged data point in the night. If your maximum light intensity is really low (e.g. due to a really shaded/dark location) you have to adjust \(L_1\) and \(L_2\) to low(er) values. One possibility is to use the 95th and 99th percentile of your light intensity data as thresholds. Then, investigate the temperature changes when light intesity is above these thresholds. If you consider these temperature data as influenced than flag the data points. You can compare suspicoius periods with the same periods on other days.

Hints: Compare relationships between light intensity and temperature across different HOBOs. Look at the distribution of measured light intensity and compare the highest x % with the classes in the table. Look at your temperature changes during high light intensity. Consider your HOBO exposition and compare light intensity of warm days (e.g. above average) with cold(er) days. Focus on hours when high(er) light intensity could be expected. Sort your data_frame with descending Lux values (i.e. %>% arrange(...)) and investigate the temperature values (and step changes…) for the Top-10, Top-50, Top-100 Lux values.

It might be helpful to use rollaplly from zoo-package with a specific width \(w\) and a specific setting for align = “…” and fill=“…”, see the following code snippet. \(w\) can be set to the total number of points that should be flagged (before and after the actual data point).


df %>% mutate(lux_th1 = if_else(lux > 30.000, 1, 0)) %>% 
             mutate(lux_th2 = rollapply(lux_th1, w, FUN = sum,...))

Check the sums in lux_th2 and think about hjow to use this information to flag a data point or not. Same procedure can be done for the other light intensity threshold.

Read carefully the help of the functionrollapply, width can also be used with a list not only with a vector (see Details there).

2. Flagging system to identify bad data:

Ideally you have now a data_frame with the measured data and separated columns for each quality check. If a data point fails at least one check the data point is flagged. If one hour of data has one or no flag you can aggregate the 10-minute values to hourly averages. If a data point has two or more flags the hour is considered as erroneous and the data point in the generated hourly temperature series is set to NA. NA-values have to be filled with the regression model to gain a hourly temperature series without NA’s.

Example of a data quality check flagging system with different quality checks in qc1, qc2, qc3. Use the flags in a systematic way, i.e. 0=ok, 1=fails. You can also use boolean values (TRUE,FALSE) and calculate the sums of the qc-columns in a qc_total-column.

dttm ta lux qc1 qc2 qc3 qc_total
2018-12-24 12:00 3.7 500 0 0 1 1
2018-12-24 12:10 3.8 4000 0 1 0 1
2018-12-24 12:20 3.9 20500 1 0 1 2
2018-12-24 12:30 5.7 110500 1 1 1 3
2018-12-24 12:40 4.2 22500 0 1 1 2
2018-12-24 12:50 4.1 32500 0 0 1 1

Small statistical analysis: Calculate how often your different quality checks flag the data and discuss the outcome of the different checks briefly in the exercise text.

3. Filling gaps with regression model

Aim is to fill the gaps in your HOBO series with data from reference series. NA-values originate from a) missing measurements and/or b) low quality data. At the end you must have an hourly temperature series without any gaps/missing values.

Helpful functions to generate hourly series:

Use group_by to generate unique groups in your tibble.

3.1 Reference stations

Download hourly air temperature data from the three reference stations and give the distances to your station in your exercise text. Data should span exactly the same period as your series and must have the same temporal resolution (hour):

WBI: http://www.wetter-bw.de DWD: ftp://ftp-cdc.dwd.de/pub/CDC/ or https://www.dwd.de/EN/climate_environment/cdc/cdc_node.html

Data from DWD Station #13667 is located in the Urban Climate folder on the DWD FTP server. Depending on your computer, browser and OS the FTP server directories will show up in the browser and/or Finder/Explorer. With Google Chrome a web-view oft the FTP directory is possible.

Locations of reference stations in Freiburg.
Locations of reference stations in Freiburg.

3.2 Regression model

Load all reference series into R into a data_frame together with your HOBO temperature series and check for temporal consistency. Aim is to find the best regression model (i.e best fitting reference station)

Build three linear regression models lm(y~x) with your hourly temperature series (HOBO) and the corresponding WBI/DWD data. With the parameters intercept \(a\) and slope \(b\) you should predict() the missing temperature data points in the HOBO series (\(y\)) with corresponding data from the reference stations (WBI/DWD) (\(x\)).

\[ y = a + bx \]

Extract the coefficient of determination (\(R^2\)) and analyze the model residuals visually for the regression models. Discuss the differences (i.e. over- versus underestimation, \(R^2\), outliers, distribution of residuals, distance from your location to reference stations) and argue which of the three regression models is used to fill your data gaps in the HOBO series. Read ?residuals in the R help. Suggestion: Insert a scatterplot showing the relationship between your HOBO data and the reference data and the regression models.

My \(R^2\) is relatively low and the relationship between my HOBO data and the reference date looks weak - what should I do?

Note: For Exercise #3 the amount of hours with NA is needed.

4. Upload hourly series

After the quality checks and the gap filling with reference stations upload your hourly HOBO time series with the name hoboid_Th.tsv (e.g. 10350099_Th.tsv) to the Github data repository.

Upload here: https://github.com/data-hydenv/data/tree/master/hobo/2020/_02hourly

The file must contain four columns:

  1. Date in the format “YYYY-MM-DD”
  2. Hour of day (0-23) in the format “HH”
  3. Temperature in °C rounded on 3 digits
  4. A flag indicating wether the value is from your HOBO station (H) or filled by regression (R)
  5. The header (from Exercise 01) with meta information (e.g. location, altitude) is not needed in this file.

Column seperator must be a tab/tabulator. Column names see below:

date hour th origin
2018-12-24 15 2.388 H
2018-12-24 16 2.428 H
2018-12-24 17 2.101 R
2018-12-24 18 2.275 H

First data point is “2018-12-14 00”, last data point is “2020-01-06 23”, in total 576 data points/hours (24 days).


library(lubridate)
seq(ymd_hm('2018-12-14 00:00'),ymd_hm('2019-01-06 23:00'), by = 'hour')