Robust analysis of compositional data

Matthias Templ
GEOSTAT 2016

Acknowledgement



  • Karel Hron and Peter Filzmoser for a long-term and fruitful cooperation

  • Karel Hron for providing his slides on compositional data analysis (used in some parts of the presentation)

Geostatistics is applied in (Wikipedia, 08.09.2016)

  • petroleum geology, hydrogeology, hydrology
  • meteorology, oceanography
  • geochemistry, geometallurgy
  • geography
  • forestry, environmental control, landscape ecology
  • soil science, agriculture

CoDa are present in all topics

  • petroleum geology, hydrogeology, hydrology
  • meteorology, oceanography
  • geochemistry, geometallurgy
  • geography
  • forestry, environmental control, landscape ecology
  • soil science, agriculture

Example compositional spatial and temporal data: Proportions of land use/land types or forest fragmentation proportions in each grid cell with potential (covariates may include elevation range, road length, population, median household income, and housing levels).

What you will see/learn?

  • What are compositional data?
  • Real space vs the simplex, representation in Coordinates
  • Examples
  • Applications in multivariate statistics using geochemical data
  • Why to use robust methods?
  • The R package robCompositions

What are compositional data?

  • \( D \)-part vectors, describing quantitatively the parts of some whole, which carry exclusively relative information between the parts (Aitchison, 1986; Pawlowsky-Glahn et al., 2015)
  • Typical units of measurement: percentages, mg/kg, mg/l
  • Examples: geochemical data - proportions of minerals in a rock; concentations of fenolical acids in wine (mg/l); household expenditures on various commodities (foodstuff, housing, clothing), forest fragmentation proportions, etc.
  • Compositional data consist of multivariate observations with positive values that sum up to a constant. Examples are proportional data or percentages, for which the values sum up row-wise to 1 or 100.

Still compositional data?

  • One or more variables of multivariate data are not available or has not been measured?
  • When rounding errors leads to violate the prescribed constraint?
  • Or what happens if the sum is not constant at all, but very different for each compositional observations?

The answer: it (always) depends on the analysis goals

The key (the new paradigm):



Compositional data are treated as multivariate data where relative rather than absolute information is relevant for the analysis.

  • Absolute information: refers to the original raw data, in their concrete units such as counts, monetary units, temperature, precipitation, etc.
  • Relative information: refers to a relative data representation, like proportions or percentages such as concentration of chemical elements in parts per million (ppm) or mg/kg, share of family income to gross household income, percentage of votes for a political party, daylight per day, etc..

Generally,

  • relative information is analyzed by considering (log-)ratios between the variables.
  • representation of data in orthonormal/orthogonal coordinates
  • analysis on orthonormal/orthogonal coordinates and backtransformation to the original space

A NO GO:
statistical analysis of compositional data using standard statistical methods with the assumption of Euclidean geometry in real space is just wrong but typically applied in practice.

Example GEMAS data, univariate case

Absolute and relative concentrations of Phosphor (P) for samples extracted by X-ray fluorescence (XRF) from agricultural soils in Europe.