Using geostatistical techniques to map the distribution of tree species from ground inventory data



Rachel Riemann Hershey
USDA Forest Service, Pennsylvania

Martin A. Ramirez
USDA Forest Service, Idaho

David A. Drake
USDA Forest Service, Pennsylvania

Abstract

Some kinds of information on biodiversity can only be collected from on-the-ground surveys, providing information complementary to that gained by remote sensing. But a ground inventory, particularly over large areas, is expensive and labor intensive. Using geostatistical techniques, we can use known values at sample points to estimate the values of unsurveyed areas in between. By exploring characteristics such as the spatial pattern and variation in the sample data collected, additional information about the unsurveyed areas can be spatially modeled and estimated. From these estimates, a modeled map and dataset of the resource can be created, complete with understandable parameters of accuracy and uncertainty. The importance of identifying the objectives, understanding the phenomena, choosing the appropriate interpolation method and assessing the results is discussed, and some of the tools available to address each of these areas are presented. Ecological subregions are suggested as appropriate units for identifying species subpopulations for modeling and interpolation. Tree-species composition in Pennsylvania and New York is used as the example, using data from the U.S. Department of Agriculture national forest inventory. The resulting modeled dataset provides a picture of individual tree-species distribution and relative abundance across the entire region.

Introduction

On-the-ground inventory can be the only way some kinds of information on forest composition can be collected. But a ground inventory is an expensive and intensive process. This presents a formidable problem when forestry management or research is conducted over large areas. An alternative approach is to "map" values by interpolation from sample locations. Geostatistical techniques such as ordinary kriging, indicator kriging, multigaussian kriging, conditional simulation, and a host of spatial data exploratory methods are existing tools that can be used for this purpose. Although much of their use and development has applied in the earth sciences, geostatistical techniques are being increasingly applied wherever such spatial tools are required for understanding and estimating the resource at unsampled locations, and for understanding the spatial characteristics of a phenomenon. Given the interest in knowing where various natural resources occur, and the uncertainty of those estimates, forestry and the analysis of terrestrial vegetation resources is a growing application area for geostatistics.

In this study we were interested in creating a "map" depicting the current distribution of individual tree species over broad areas, such as the entire states of Pennsylvania or New York. We were interested in estimates of where each species occurred, in what densities it occurred relative to the other species in that area, and in the uncertainty associated with those estimates. Of significant interest, too, were the variability and the spatial character of each species' occurrence, and whether this reflected a known natural pattern of that species, or affected our ability to estimate it. And we wanted to capture this information in a form that could be used in conjunction with other datasets and further analyses.

Approach

With each type of kriging or simulation, the assumptions are different, the results are different, and the data types can be different, making each most useful in different situations. Which technique is most appropriate depends heavily on the phenomena being examined, the sample data being used, and the kind(s) of output desired.

Phenomena. How tree species are distributed across the landscape is affected by many factors, including environmental conditions and direct human influence via harvesting and other land-use histories. And each factor may occur at several different scales. Over large areas, large-scale geographic and environmental factors such as latitude, climate, mountain ranges, and other large topographic features all have an effect. In Pennsylvania, for example, sugar maple is found along with the other northern hardwoods predominantly across the northern part of the state, where both the higher latitudes and the higher elevations of the Allegheny mountains and the Allegheny Plateau coincide (Alerich 1993). Heavy use of hemlock for the tanning industry over the last century also has allowed the relative abundance and distribution of the northern hardwoods, including sugar maple, to expand into areas that had been previously dominated by hemlock (Hough and Forbes 1943; Powell and Considine 1982). At a different scale, over much smaller distances, moisture, soil type, local topography, and local disturbances can have a distinct effect on the occurrence of tree species. And all of those factors can frequently occur at spatial scales less than that resolved by the forest inventory plots used in this study. It was expected that the influences of these small-scale factors would be expressed simply as noise in any spatial exploration of the data, but that much of the variation in tree species resulting from the broad-scale factors probably would be picked up by the FIA plot data because of the better matching of scales at which they occur.

Data. The sample data were collected by the Northeastern Forest Experiment Station's Forest Inventory and Analysis (FIA) unit in the 1988-1990 inventory of Pennsylvania. Basal area--the summed cross-sectional area at breast height--was calculated for all live trees 1.0 inches or larger on the plot (Hansen et al. 1992). The data were for individual tree species, by basal area (ba) per acre as a proportion of the total basal area (% ba/acre). The data were accessed from individual tree records in the Forest Service's Eastwide Tree-Level Database and summarized as %ba/acre for each species by plot. In Pennsylvania, there were 5,100 plots. After some initial investigation, nonforested plots and those with total ba/acre equal to zero (due to missing data) were removed, leaving 2,905 plots. The fine-scale forest/nonforest detail was simply not resolvable at this scale of sampling. Thus, removing the nonforest points from the analysis correspondingly removed considerable noise from the detectable spatial structure of each species' distribution. Although there was no fixed spacing, this sampling intensity and selection of forested plots only, results in a distribution of plots averaging 2.5 km apart and spread fairly evenly throughout the state.

Desired output. The study objective was to create, using geostatistical techniques, a dataset of individual tree-species distribution that could be used in conjunction with other datasets and further spatial analyses. This general objective could be broken down into three more specific objectives. First, we were interested in a dataset that for each species provided a picture of the spatial distribution and spatial pattern of that species to add to our understanding of that resource. Second, we were interested in providing datasets for individual species distributions that could be queried and combined to create a dataset of an individual forest cover type specific to a user's definition. Third, we were interested in these methods and dataset(s) as a possible way to make the FIA sample data more easily available in a spatial context. Using sugar maple as the example, these objectives could all be translated into more specific desires as:

Maintaining the variation inherent in the sample data was considered a critical goal, for that variation is essential to reminding us, when we use any modeled dataset, that there are processes occurring in the landscape for which we do not have an accurate and sufficiently detailed model, for which our present source of information (i.e. the sample plots) and/or present knowledge are insufficient. This information on variation can be described in the form of an uncertainty about the estimate at each location, and in the form of retaining the local spatial variation in the sample data. Both forms are reflected in two of the criteria listed previously.

Methods

Data exploration

Exploring the sample data is a critical first step prior to any interpolation. It is important not only for checking the data for possible errors, but also for understanding the characteristics of the sample data that may bias the results. Those characteristics, along with what is desired for output, will affect what interpolation methods are most applicable. Simply plotting the values was a way to check for apparent anomalies and possible errors in the sample data, to check for any visible trends, and to look for different areas of the dataset that appear to behave spatially differently from other areas of the dataset and may be worthy of separate investigation.

A close look at the basic univariate statistics revealed that each species exhibited extremely skewed distributions, frequently with more than 50% of the plots containing less than 1%ba/acre. Although expected from what we know of tree-species distributions, this situation can create a bias in the estimates of some interpolation methods. In this study, the data were transformed to a normal distribution for use in conjunction with the conditional simulation. The normal-score transform used was a nonlinear transform, but it provided a 1-to-1 mapping of values to allow for an accurate back transformation to %ba/acre values (Deutsch and Journel 1992).

The variogram, covariance, and correlogram, three functions for summarizing spatial continuity, each essentially depict the variation between sample data values at increasing distances from each other (Isaaks and Srivastava 1989). If there is recognizable spatial dependence or structure in the variogram, a model can be fitted to that structure. It is this model of how that species' %ba/acre values vary (on average) with distance away from any single point that is used in the kriging and simulation routines. When the data have strong univariate characteristics, that can mask the data's spatial characteristics as depicted in the variogram. In such cases, as was true for every species examined here, the variogram of the normal-scored data often reveals considerably more spatial dependence and structure than that of the raw data. In this study, a variogram was calculated for both the raw sample data and for a normal-score transform of the sample data, using a lag distance of 500 m with no directional component. An indicator variogram also was calculated and modeled for use in the indicator kriging.

To assess how areas of "local" variability change across the region, descriptive statistics of mean and standard deviation were calculated for each 3000 x 3000 m cell using a 15 x 15 km area as the window describing the "local" area. The resulting dataset provided a picture of how variable the data were over small distances and how that variation differed in different areas of the region under study. All species exhibited a proportional effect, with areas of high local means corresponding to areas of high local standard deviation, indicating a lack of stationarity. This situation was not entirely unexpected but posed a potential problem as stationarity is an important assumption of all of the interpolation methods used here. Using normal-scored data largely eliminated this situation.

Choosing appropriate areas/regions

When subpopulations of a species have a significantly different pattern of spatial distribution, treating the populations separately in the interpolation will improve the final estimates. Essentially, by capturing more than one population within a single variogram, the resulting model is frequently appropriate to neither. These subpopulations may be described by geographic area or by some other defining characteristic such as stand age (e.g., hemlock in Pennsylvania) (Riemann Hershey 1996). Evidence from the data exploration, and/or any previous knowledge about the ecology or land-use history of the phenomena, can indicate that there may be several subpopulations of the same species that could be separated. In the case of tree species, where the spatial pattern and distribution of individual tree species is still more influenced by latitude and large topographic features than by human administrative boundaries such as states, it was suspected that ecological subregions might provide a better default area of investigation than states (Fig. 1). These ecological units, part of the Forest Service's national hierarchical framework of ecological units, are classified and mapped based on "associations of those biotic and environmental factors that directly affect or indirectly express energy, moisture, and nutrient gradients which regulate the structure and function of ecosystems. These factors include climate, physiography, water, soils, air, hydrology, and potential natural communities" (McNab and Avers 1994). Several ecosubregions in Pennsylvania and New York were investigated separately and their statistical characteristics compared with each other and with the state summaries. Where there was substantial difference, particularly in the variogram, the subregions were modeled and interpolated separately.

Interpolation methods

Kriging and the related routines of conditional simulation offer a way of creating a relatively continuous surface of estimated values which take advantage of (and can retain in some cases) the local spatial variation inherent in the sample dataset. There are several types of kriging and conditional simulation. Each does something different with the data, makes different assumptions about the data, provides different outputs, and, therefore, is useful in different situations. The use of splines was considered, but because the variables we were considering do not change gradually over space but frequently contain a large amount of local variation with sometimes sudden changes, splines were dismissed early as inappropriate here (Laslett 1994). The following are only highlights of four of the methods chosen as most appropriate, including some of the most distinctive features of each that affected its applicability here. The importance of each of these features depends on what we are interested in from the estimation.

Kriging estimates are essentially weighted moving averages of the original data values--taking the distance, direction, and redundancy of neighboring points into account using the model defined from the variogram. Ordinary kriging (OK) does honor the overall (global) mean and the sample data values. However, the averaging process results in output that is distinctly smoothed, and pockets of one or two high sample values can have a substantial effect on the surrounding estimates--undesirable characteristics for our objectives. For a type of error term, OK does report an estimation variance, but when the sample data are not multivariate normal, its use as a measure of estimate reliability is probably erroneous because it is dependent on the data location and redundancy of the sample points and variogram model type used in the estimate. Kriging estimation variance is not dependent on data values (Deutsch and Journel 1992).

An indicator transform divides the data into two classes--above or below a designated cutoff value. Indicator kriging (IK) provides an estimate of the probability, for each estimated cell, that it falls above or below that cutoff value. The output dataset thus indicated the probability that sugar maple occurred at each location. In this study, the cutoff chosen was 0%ba/acre, indicating the presence or absence of a particular species. Indicator kriging, unlike OK, does not assume that the datahave a particular distribution.

A fourth method investigated was sequential gaussian conditional simulation (sgCS) (Deutsch and Journel 1992; Rossi et al. 1993). Rather than determining a single "best" estimate, conditional simulation determines multiple estimates for each cell. All are equally probable and yet alternative realizations of the data determined from multiple simulation runs. This set of estimates can be used to build an entire distribution for each cell, that represents the range of possible values. A summary statistic such as the mean or median of this distribution can then be chosen and used as the modeled "estimate" of %ba/acre for that cell, and another value such as the standard deviation or inter-quartile range as the expression of uncertainty or "range of uncertainty." In this study, the median and a percentile range capturing approximately two-thirds of the distribution (17th and 83rd percentiles) were chosen as representative of each cell's distribution. Unlike OK, sgCS does honor in each realization both the sample histogram (and the univariate statistics of the sample data) and the sample variogram/covariance. In addition, spatial variability is an integral component, and sgCS is effective in preserving local spatial means. It also retains much more local variability, a characteristic of the resource at this scale that we were particularly interested in retaining. The disadvantage of sgCS is that it assumes that the sample data are univariate normal, bivariate normal, and higher moment normal. In this case, we could make the data univariate normal with a normal-score transform and check it against the behavior of a bivariate normal dataset. In the likely scenario that all criteria are not fully satisfied, as was assumed to be the case here, the user should be aware that the results may be affected (Rossi et al. 1993). To determine whether there was substantial bias in the results, we chose to check independently the charactistics of an individual realization against the characteristics of the sample data.

Results

The objective of this study was slightly unusual compared to applications in earth sciences in one respect--a dataset was desired not to answer a specific question (e.g., where is the best location for a sugarbush or a mill), but to provide a resource dataset(s) that could be called on for many uses--ideally, wherever spatial information about species distribution and occurrence is required. Thus, there are four datasets, or four pieces of data for each cell location, that have been chosen here as final output (Fig. 2). First, the results of the indicator kriging run, the probability that there is sugar maple present in a particular cell, is a valuable dataset. This method makes few requirements of the data and, therefore, outputs robust results. If a user is interested only in where sugar maple occurs, then the probability output from IK is the most useful, and, therefore, was chosen as one of the critical output datasets to be maintained here. The next three output datasets chosen, an estimate of %ba/acre and the plus and minus uncertainty associated with that estimate, address the interest in knowing how much sugar maple occurs. For this, the results of the conditional simulation method were chosen. Although its demands on the normality of the sample data were great, in this case there appeared to be little bias when the characteristics of the predicted estimates were compared with what was expected of them. Although this may not be true for all species, and would have to be checked in each case, this method is most desirable because of its capacity to retain the local spatial variation inherent in the sample data. The summary datasets created from the conditional simulation runs (sgCS-median, sgCS-17 and sgCS-83) provide both an estimate and a measure of uncertainty, retain the most local variability, and maintain the characteristics of the sample data fairly well. The median "estimate" of %ba/acre values alone did not maintain the FIA summary statistics at the state and unit level, but those statistics did fall within the calculated uncertainty--between the 17th and 83rd percentiles (capturing 66% of the estimate distribution at each cell). For this study and given the high level of variability in sugar maple values at this scale of sampling, this was considered as both reasonable and acceptable.

Dividing and modeling the data by ecoregion did reveal a different spatial structure in each. Adjacent ecosubregions had substantially different variograms, indicating the probability of different subpopulations of sugar maple in each, as least in terms of their spatial distribution patterns (Fig. 3). Given this evidence, future investigation and interpolation of individual tree species in the Northeast will be by ecosubregions.

The accuracy of the results from any model is important information for users of those data. How well does the output dataset portray the characteristics of the phenomena it was designed to capture? If it was intended to estimate current landcover as observed on the ground, how well does it do that? If it was intended to capture local variability, how well does it do that? Whether the resulting estimated dataset actually describes the occurrence and density of sugar maple involves two assumptions: 1) that the modeled dataset is representative of the sample data, and 2) that the sample data are representative of the phenomena. The first item can be checked by comparing the characteristics of the field data to those of the model predictions at as many scales as are appropriate given the data. The second item is determined largely by the sample design and our understanding of the phenomena, but we also can check the results against other independent sources of individual tree-species distribution--ideally of higher accuracy--where they exist. In this case, the lack of independent datasets of sufficient detail and describing the information we were trying to capture left us concentrating primarily on checking the first assumption. In this study, the univariate, variogram/covariance, and local statistics of each of the estimated datasets were compared to those of the original sample data (Fig. 4).

Discussion/Conclusions

The estimated datasets that are output by indicator kriging and sgCS have the potential to be very useful. Most tree species in the area examined exhibit substantial spatial dependence in the variograms of the normal-scored data, suggesting that there is considerable potential for the estimation of such datasets from FIA data. Each species exhibited some variety in spatial patterns, spatial dependence, and resulting uncertainty of the estimates, and each may therefore require different levels of additional fine tuning depending on the objectives of the specific analysis and the time and expertise available.

These techniques make explicit the uncertainties associated with an estimate in a form that can be incorporated when the data are used. This feature adds considerable utility and flexibility in the use of the resulting estimates, as the risk of errors of commission or omission can be specifically determined and manipulated to suit the current objectives.

The data exploration techniques used also proved invaluable in understanding the spatial characteristics of the species/variable being examined. Techniques included univariate analysis, variograms, and other spatial dependence analyses, and calculating local statistics. The resulting information was critical not only in determining the most suitable interpolation methods and for checking the sample data for errors, but also in understanding the characteristics of the sample data and thus the phenomena being investigated.

When the variables and the objectives do not require detailed refinement, the techniques used in this study are not extremely time consuming nor difficult to process, and could be extended to additional species, other stand attributes, and states as the current inventory data becomes available. There is a high level of variance associated with these estimates of %ba/acre--in many locations this variance can be as much as the estimate itself.. The dataset nevertheless provides a descriptive picture of species distribution at the state level. Compared to previous depictions of current species distribution from FIA data summarized at the county level, this method provides a much more detailed picture of species occurrence and distribution. Although estimates probably could be improved and variances diminished by additional investigation into each species, such as further dividing subpopulations or collecting additional data where the uncertainty is too high, the current estimates are informative and provide a useful basis from which to proceed.

These datasets of individual species distribution do not contain any of the fine-scale forest/nonforest detail. If such information is desired, more detailed datasets describing the forest/nonforest land cover would have to be derived from a more intense point sample or the continuous but averaged data available from satellite imagery. Such detailed datasets could be used as a "mask" and overlaid on any of the datasets of species distribution.

Maintaining information on individual species separately allows considerable flexibility in the use of species distribution data. Instead of being limited to previously defined fixed classes, forest cover types can be uniquely defined to capture more accurately the habitat required for a particular study. The potential also exists to use one or more of the species datasets as a decision layer in the interpretation of satellite imagery. The two datasets offer complementary information about the actual species composition on the ground.

There are more possibilities for applying geostatistical techniques than have been investigated here. For example, some species may exhibit a spatial correlation with a particular soil, climate, topography, or reflectance data from satellite imagery. Indicator kriging and sgCS, in particular, allow such ancillary or "soft" information to be incorporated into the estimation process.

References

Alerich, C.L. 1993. Forest Statistics for Pennsylvania--1978 and 1989. Resource Bulletin. NE-126. Radnor, PA. USDA Forest Service, Northeastern Forest Experiment Station.

Biondi, F., D.E. Myers, and C.C. Avery. 1994. Geostatistically modeling stem size and increment in an old growth forest. Canadian Journal of Forest Research. 24(7): 1354-1368.

Deutsch, C.V. and A.G. Journel. 1992. GSLIB: Geostatistical Software Library and User's Guide. Oxford University Press, New York.

Ghosh, M. and J.N.K. Rao. 1994. Small Area Estimation: An Appraisal. Statistical Sciences. 9(1):55-93.

Hansen M.H., T. Frieswyk, J.F. Glover, and J.F. Kelly. 1992. The Eastwide Forest Inventory Data Base: Users Manual. General Technical Report NC-151. USDA Forest Service, North Central Experiment Station. St. Paul, MN.

Hough, A.F. and R.D. Forbes. 1943. The ecology and silvics of forests in the High Plateaus of Pennsylvania. Ecological Monographs. 13:299-320.

Isaaks, E.H. and R.M. Srivastava. 1989. An Introduction to Applied Geostatistics. Oxford University Press, New York.

McNab, W.H. and P.E. Avers. 1984. Ecological Subregions of the United States: Section Descriptions. Administrative Publication WO-WSA-5. USDA Forest Service, Washington, DC.

Laslett, G.M. 1994. Kriging and splines: an empirical comparison of their predictive performance in some applications. Journal of the American Statistical Association, Applications and Case Studies. 89(426):391-400.

Powell, D.S. and T.J. Considine. 1982. An analysis of Pennsylvania's forest resources. Resource Bulletin. NE-69. Broomall, PA. USDA Forest Service, Northeastern Forest Experiment Station.

Riemann Hershey, R. 1996. Understanding the spatial distribution of tree species in Pennsylvania. pp. 73-82. In: Mowrer, H.T. et al. (tech coords). Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Second International Symposium. May 21-23. 1996. General Technical Report. RM-GTR-277. Fort Collins, CO. USDA Forest Service, Rocky Mountain Forest and Range Experiment Station.

Rossi, R.E., D.J. Mulla, A.G. Journel, and E.H. Franz. 1992. Geostatistical tools for modeling and interpreting ecological spatial dependence. Ecological Monographs. 62(2):277-314.

Rossi, R.E., P.W. Borth, and J.J. Tollefson. 1993. Stochastic simulation for characterizing ecological spatial patterns and appraising risk. Ecological Applications. 3(4):719-735.


[Back to RSB Page]