
Prepare eBird Status and Trends data for BirdFlow model fitting
Source:R/preprocess_species.R
preprocess_species.RdWrite a template BirdFlow object to an hdf5 file based on distribution data
downloaded with ebirdst. The object is complete except for marginals
and transitions. Use ... to truncate the model to part of the year.
Usage
preprocess_species(
species = NULL,
out_dir = NULL,
res = NULL,
hdf5 = TRUE,
overwrite = TRUE,
crs = NULL,
clip = NULL,
max_params = NULL,
gpu_ram = 12,
skip_quality_checks = FALSE,
min_season_quality = 3,
trim_quantile = NULL,
...
)Arguments
- species
A species in any format accepted by
ebirdst::get_species()- out_dir
Output directory, files will be written here. Required unless
hdf5is FALSE. File names created here will incorporate the species code, resolution, and eBird version year.- res
The target resolution of the BirdFlow model in kilometers. If
resis NULL (default) then a resolution that results in less thanmax_paramsparameters will be used, while also minimizing the resolution and limiting the number of significant digits.- hdf5
If TRUE (default) an hdf5 file will be exported.
- overwrite
If TRUE (default) any pre-existing output files will be overwritten. If FALSE pre-existing files will result in an error.
- crs
Coordinate reference system (CRS) to use. Defaults to the custom projection eBird has assigned to this species - see
ebirdst::load_fac_map_parameters()). It will be interpreted byterra::crs()to generate a well known text representation of the CRS.- clip
A polygon or the path to a file containing a polygon. It must have a CRS and should either be a SpatVector() object or produce one when called with vect(clip)
- max_params
The maximum number of fitted parameters that the BirdFlow model should contain. Ignored if
resis not NULL. Otherwise a resolution will be chosen that yields this many fitted parameters. Seegpu_ramfor the default way of settingmax_paramsandres. Note: the reduction in parameters resulting from truncation (see...) is not factored into the calculation.- gpu_ram
Gigabytes of ram on GPU machine that will fit the models. If
resis NULL andmax_paramsis NULL this is used to estimatemax_paramswhich is, in turn, used to determine the resolution. Ignored if eitherresormax_paramsis set.- skip_quality_checks
If
TRUEthan preprocess the species even if not all of four ranges are modeled (under ebirdst 2021 version year) or for 2022 and subsequent data versions if not all<season>_qualityis higher thanmin_season_qualityin ebirdst_runs).- min_season_quality
The minimum acceptable season quality when preprocessing eBird 2022 and subsequent versions. Used to check model quality using based on the four
<season>_model_qualitycolumns in ebirdst_runs ignored with 2021 ebirdst version year.- trim_quantile
With the default of
NULLthere is no outlier trimming, otherwise a single value between 0 and 1 to indicate the quantile to truncate at or a series of 52 such values corresponding with each week. Trimming outliers is always done week by week with the values above thetrim_quantilequantile set to the value of that quantile. Reasonable nonNULLvalues will be close to 1 e.g. 0.99, 0.995, 0.999. Settrim_quantileto eliminate high outliers that are believed to be model artifacts. See Issue #189 for detailed justification.- ...
Arguments passed on to
lookup_timestep_sequenceseasona season name, season alias, or "all". See
lookup_season_timesteps()for options.startThe starting point in time specified as a timestep, character date, or date object.
endThe ending point in time as a date or timestep.
directionEither "forward" or "backward" defaults to
"forward"if not processing dates. If using date inputdirectionis optional and is only used to verify the direction implicit in the dates.season_bufferOnly used with
seasoninput.season_bufferis passed tolookup_season_timesteps()and defaults to 1; it is the number of timesteps to extend the season by at each end.n_stepsAlternative to
endThe end will ben_stepsaway fromstartindirection; and the resulting sequence will haven_steptransitions andn_steps + 1timesteps.
Value
Returns a BirdFlow model object that lacks marginals, but is otherwise complete. It is suitable for fitting with BirdFlowPy.
Maximum number of parameters
The maximum number of parameters that can be fit is machine dependent. 2023-02-10 we tested under different resolutions with "amewoo" and identified bounds on the maximum.
| Machine | GPU Ram (GB) | Lower Bound (worked) | Upper Bound (failed) | Parameters / GB |
| titanx GPU | 12GB | 306804561 | 334693725 | 25567047 |
| m40 GPU | 24GB | 557395226 | 610352178 | 23224801 |
The number of parameters is the number of unmasked cells for the first timestep + the total number of cells in the marginals which is calculated from the dynamic mask.
If gpu_ram is used (and not res or max_parameters ) than
max_parameters is set to 23,224,801 * gpu_ram (lower of two values in
table above).
The heuristic to determine resolution given a maximum number of parameters
must estimate the number of cells covered by the data
at a different resolution, a noisy process, so it iteratively tries to find
the smallest resolution that doesn't exceed max_params and then rounds to
a slightly larger resolution (fewer parameters).
Examples
if (FALSE) { # \dontrun{
bf <- preprocess_species("amewoo", hdf5 = FALSE )
plot_distr(get_distr(bf, c(1, 26)), bf = bf)
# Create clip polygon as an sf object
# Use the extent rectangle but with western edge moved in
# The clip can be anything that terra::vect will process into a polygon
e <- ext(bf)
e[1] <- -1500000
coords <- matrix(c(e[1], e[3],
e[1], e[4],
e[2], e[4],
e[2], e[3],
e[1], e[3]), ncol = 2, byrow = TRUE)
sfc <- sf::st_sfc(sf::st_polygon(list(coords)), crs = crs(bf))
clip <- sf::st_sf(data.frame(id = 1, geom = sfc))
bfc <- preprocess_species("amewoo", hdf5 = FALSE, clip = clip ) # with clip
plot_distr(get_distr(bfc, 1), bfc)
} # }