Data Preparation

This guide covers how to structure your data for webforest, handle missing values, and prepare data from common meta-analysis workflows.

Key Principle: Column Names, Not Values

webforest uses a column-mapping pattern: you specify which columns contain your data, not the values themselves. This makes it easy to use any data frame without renaming columns.

# Your data can have any column names
my_data <- data.frame(my_study = ..., my_or = ..., my_lo = ..., my_hi = ...)

# Just map them to the right arguments
forest_plot(my_data, point = "my_or", lower = "my_lo", upper = "my_hi", label = "my_study")

Required Columns

At minimum, forest_plot() needs four column mappings:

Argument	Description	Example
`point`	Point estimate (effect size)	Hazard ratio, odds ratio, mean difference
`lower`	Lower confidence interval bound	95% CI lower
`upper`	Upper confidence interval bound	95% CI upper
`label`	Row label text	Study name, subgroup

Code

# Minimal example
data <- data.frame(
  study = c("Smith 2020", "Jones 2021", "Lee 2022"),
  hr = c(0.72, 0.85, 0.91),
  lo = c(0.55, 0.70, 0.75),
  hi = c(0.95, 1.03, 1.10)
)

forest_plot(data,
  point = "hr", lower = "lo", upper = "hi", label = "study",
  scale = "log", null_value = 1
)

Optional Columns

Beyond the core four, you can map additional columns for styling and display:

Category	Columns	Purpose
Grouping	`group`	Hierarchical nesting
Row styling	`row_type`, `row_bold`, `row_indent`, `row_color`, `row_badge`	Per-row appearance
Cell styling	`style_cols`, `style_bold`, `style_color`, `style_bg`	Per-cell formatting
Display	Any column referenced in `columns = list(...)`	Extra data columns

Handling Missing Values (NA)

webforest uses NA values strategically for structured layouts:

Header and Spacer Rows

Rows with row_type = "header" or "spacer" typically have NA for effect estimates. The plot renders these as label-only rows without intervals:

Code

structured <- data.frame(
  label = c("Primary Outcomes", "  CV Death", "  MI", "", "Secondary"),
  hr = c(NA, 0.82, 0.79, NA, NA),
  lower = c(NA, 0.72, 0.68, NA, NA),
  upper = c(NA, 0.94, 0.92, NA, NA),
  rtype = c("header", "data", "data", "spacer", "header"),
  rbold = c(TRUE, FALSE, FALSE, FALSE, TRUE)
)

forest_plot(structured,
  point = "hr", lower = "lower", upper = "upper", label = "label",
  row_type = "rtype", row_bold = "rbold",
  scale = "log", null_value = 1
)

Missing Effect Estimates

For data rows where an effect couldn’t be calculated, NA values display the label without plotting an interval. This is useful for subgroups with insufficient data.

Styling Column NAs

NA in styling columns (e.g., row_color, row_badge) means “use default” - no special styling is applied.

Scale Considerations

Log Scale

Log Scale Requires Positive Values

When using scale = "log", all values in point, lower, and upper must be positive. Zero or negative values will cause rendering errors.

When using scale = "log", all values in point, lower, and upper must be positive:

Code

# This will cause issues:
bad_data <- data.frame(
  study = "Problematic",
  or = 0,        # Zero breaks log scale
  lower = -0.1,  # Negative breaks log scale
  upper = 1.5
)

# Solution: Filter or handle before plotting
good_data <- your_data |>
  filter(or > 0, lower > 0, upper > 0)

Typical null_value for log scale: 1 (ratio of 1 = no effect)

Linear Scale

Linear scale accepts any numeric values including negatives:

Code

# Mean difference example (linear scale)
diff_data <- data.frame(
  comparison = c("Treatment A", "Treatment B", "Treatment C"),
  mean_diff = c(-2.5, 1.3, -0.8),
  lower = c(-4.1, -0.2, -2.1),
  upper = c(-0.9, 2.8, 0.5)
)

forest_plot(diff_data,
  point = "mean_diff", lower = "lower", upper = "upper",
  label = "comparison",
  scale = "linear", null_value = 0,
  axis_label = "Mean Difference (95% CI)"
)

Typical null_value for linear scale: 0 (difference of 0 = no effect)

Creating Grouping Columns

Single-Level Grouping

Use a categorical column to group rows:

Code

trials <- data.frame(
  study = c("ADVANCE", "SPRINT", "ACCORD", "ONTARGET"),
  region = c("Europe", "North America", "North America", "Global"),
  hr = c(0.91, 0.75, 0.88, 0.94),
  lower = c(0.83, 0.64, 0.76, 0.86),
  upper = c(1.01, 0.87, 1.01, 1.02)
)

forest_plot(trials,
  point = "hr", lower = "lower", upper = "upper",
  label = "study", group = "region",
  scale = "log", null_value = 1
)

Hierarchical (Nested) Grouping

Pass multiple column names for nested subgroups:

Code

nested <- data.frame(
  study = c("Site A", "Site B", "Site C", "Site D", "Site E", "Site F"),
  region = c("Americas", "Americas", "Americas", "Europe", "Europe", "Europe"),
  country = c("USA", "USA", "Brazil", "UK", "Germany", "Germany"),
  hr = c(0.72, 0.85, 0.79, 0.88, 0.91, 0.76),
  lower = c(0.58, 0.71, 0.62, 0.74, 0.78, 0.61),
  upper = c(0.89, 1.02, 1.01, 1.05, 1.06, 0.95)
)

forest_plot(nested,
  point = "hr", lower = "lower", upper = "upper",
  label = "study",
  group = c("region", "country"),  # Nested: region > country
  scale = "log", null_value = 1
)

Working with Meta-Analysis Results

From metafor

Code

library(metafor)

# Run meta-analysis
res <- rma(yi = log_or, sei = se, data = studies, method = "REML")

# Convert to webforest format
forest_data <- studies |>
  mutate(
    or = exp(log_or),
    lower = exp(log_or - 1.96 * se),
    upper = exp(log_or + 1.96 * se)
  ) |>
  # Add pooled estimate as summary row

  bind_rows(
    tibble(
      study = "Pooled Estimate",
      or = exp(res$b),
      lower = exp(res$ci.lb),
      upper = exp(res$ci.ub),
      rtype = "summary",
      rbold = TRUE
    )
  )

forest_plot(forest_data,
  point = "or", lower = "lower", upper = "upper",
  label = "study",
  row_type = "rtype", row_bold = "rbold",
  scale = "log", null_value = 1
)

From meta Package

Code

library(meta)

# Run meta-analysis
m <- metagen(TE = log_or, seTE = se, studlab = study, data = studies)

# Extract study-level data
forest_data <- tibble(
  study = m$studlab,
  or = exp(m$TE),
  lower = exp(m$lower),
  upper = exp(m$upper),
  weight = m$w.random / sum(m$w.random) * 100
) |>
  bind_rows(
    tibble(
      study = "Random Effects",
      or = exp(m$TE.random),
      lower = exp(m$lower.random),
      upper = exp(m$upper.random),
      rtype = "summary",
      rbold = TRUE
    )
  )

Common Data Transformations

Adding Row Types

Code

# Transform flat data into structured layout
raw <- data.frame(
  outcome = c("CV Death", "MI", "Stroke"),
  category = c("Primary", "Primary", "Secondary"),
  hr = c(0.82, 0.79, 0.88),
  lower = c(0.72, 0.68, 0.74),
  upper = c(0.94, 0.92, 1.05)
)

structured <- raw |>
  group_by(category) |>
  group_modify(~ {
    header <- tibble(
      outcome = .y$category,
      hr = NA, lower = NA, upper = NA,
      rtype = "header", rbold = TRUE, rindent = 0
    )
    data <- .x |>
      mutate(
        outcome = paste0("  ", outcome),
        rtype = "data", rbold = FALSE, rindent = 1
      )
    bind_rows(header, data)
  }) |>
  ungroup()

Computing Weight Percentages

Code

studies |>
  mutate(
    # Inverse-variance weight
    weight = 1 / se^2,
    weight_pct = weight / sum(weight) * 100
  )

Formatting Confidence Intervals

If you want a pre-formatted CI column for display:

Code

data |>
  mutate(
    ci_text = sprintf("%.2f (%.2f-%.2f)", hr, lower, upper)
  )

Use col_text("ci_text", "HR (95% CI)") to display it, or use col_interval() for automatic formatting.

Data Validation Tips

Check for non-positive values before log scale: any(data$hr <= 0)
Verify CI ordering: all(data$lower <= data$hr & data$hr <= data$upper)
Check for character columns: sapply(data, class) - numeric columns shouldn’t be character
Preview structured data: Print the data frame to verify header/spacer row placement

Code

# Quick validation function
validate_forest_data <- function(data, point, lower, upper, scale = "linear") {
  issues <- character()

  p <- data[[point]]
  l <- data[[lower]]
  u <- data[[upper]]

  # Filter to non-NA (data rows only)
  valid <- !is.na(p)

  if (scale == "log" && any(p[valid] <= 0 | l[valid] <= 0 | u[valid] <= 0)) {
    issues <- c(issues, "Log scale requires all positive values")
  }

  if (any(l[valid] > p[valid] | p[valid] > u[valid])) {
    issues <- c(issues, "CI bounds should satisfy: lower <= point <= upper")
  }

  if (length(issues) == 0) {
    message("Data looks valid!")
  } else {
    warning(paste(issues, collapse = "\n"))
  }
}

--- title: "Data Preparation" --- ```{r} #| include: false library(webforest) library(dplyr) ``` This guide covers how to structure your data for webforest, handle missing values, and prepare data from common meta-analysis workflows. ::: {.callout-tip} ## Key Principle: Column Names, Not Values webforest uses a **column-mapping pattern**: you specify which columns contain your data, not the values themselves. This makes it easy to use any data frame without renaming columns. ```r # Your data can have any column names my_data <- data.frame(my_study = ..., my_or = ..., my_lo = ..., my_hi = ...) # Just map them to the right arguments forest_plot(my_data, point = "my_or", lower = "my_lo", upper = "my_hi", label = "my_study") ``` ::: ## Required Columns At minimum, `forest_plot()` needs four column mappings: | Argument | Description | Example | |----------|-------------|---------| | `point` | Point estimate (effect size) | Hazard ratio, odds ratio, mean difference | | `lower` | Lower confidence interval bound | 95% CI lower | | `upper` | Upper confidence interval bound | 95% CI upper | | `label` | Row label text | Study name, subgroup | ```{r} # Minimal example data <- data.frame( study = c("Smith 2020", "Jones 2021", "Lee 2022"), hr = c(0.72, 0.85, 0.91), lo = c(0.55, 0.70, 0.75), hi = c(0.95, 1.03, 1.10) ) forest_plot(data, point = "hr", lower = "lo", upper = "hi", label = "study", scale = "log", null_value = 1 ) ``` ## Optional Columns Beyond the core four, you can map additional columns for styling and display: | Category | Columns | Purpose | |----------|---------|---------| | **Grouping** | `group` | Hierarchical nesting | | **Row styling** | `row_type`, `row_bold`, `row_indent`, `row_color`, `row_badge` | Per-row appearance | | **Cell styling** | `style_cols`, `style_bold`, `style_color`, `style_bg` | Per-cell formatting | | **Display** | Any column referenced in `columns = list(...)` | Extra data columns | ## Handling Missing Values (NA) webforest uses `NA` values strategically for structured layouts: ### Header and Spacer Rows Rows with `row_type = "header"` or `"spacer"` typically have `NA` for effect estimates. The plot renders these as label-only rows without intervals: ```{r} structured <- data.frame( label = c("Primary Outcomes", " CV Death", " MI", "", "Secondary"), hr = c(NA, 0.82, 0.79, NA, NA), lower = c(NA, 0.72, 0.68, NA, NA), upper = c(NA, 0.94, 0.92, NA, NA), rtype = c("header", "data", "data", "spacer", "header"), rbold = c(TRUE, FALSE, FALSE, FALSE, TRUE) ) forest_plot(structured, point = "hr", lower = "lower", upper = "upper", label = "label", row_type = "rtype", row_bold = "rbold", scale = "log", null_value = 1 ) ``` ### Missing Effect Estimates For data rows where an effect couldn't be calculated, `NA` values display the label without plotting an interval. This is useful for subgroups with insufficient data. ### Styling Column NAs `NA` in styling columns (e.g., `row_color`, `row_badge`) means "use default" - no special styling is applied. ## Scale Considerations ### Log Scale ::: {.callout-warning} ## Log Scale Requires Positive Values When using `scale = "log"`, all values in `point`, `lower`, and `upper` must be **positive**. Zero or negative values will cause rendering errors. ::: When using `scale = "log"`, all values in `point`, `lower`, and `upper` must be positive: ```{r} #| eval: false # This will cause issues: bad_data <- data.frame( study = "Problematic", or = 0, # Zero breaks log scale lower = -0.1, # Negative breaks log scale upper = 1.5 ) # Solution: Filter or handle before plotting good_data <- your_data |> filter(or > 0, lower > 0, upper > 0) ``` Typical `null_value` for log scale: `1` (ratio of 1 = no effect) ### Linear Scale Linear scale accepts any numeric values including negatives: ```{r} # Mean difference example (linear scale) diff_data <- data.frame( comparison = c("Treatment A", "Treatment B", "Treatment C"), mean_diff = c(-2.5, 1.3, -0.8), lower = c(-4.1, -0.2, -2.1), upper = c(-0.9, 2.8, 0.5) ) forest_plot(diff_data, point = "mean_diff", lower = "lower", upper = "upper", label = "comparison", scale = "linear", null_value = 0, axis_label = "Mean Difference (95% CI)" ) ``` Typical `null_value` for linear scale: `0` (difference of 0 = no effect) ## Creating Grouping Columns ### Single-Level Grouping Use a categorical column to group rows: ```{r} trials <- data.frame( study = c("ADVANCE", "SPRINT", "ACCORD", "ONTARGET"), region = c("Europe", "North America", "North America", "Global"), hr = c(0.91, 0.75, 0.88, 0.94), lower = c(0.83, 0.64, 0.76, 0.86), upper = c(1.01, 0.87, 1.01, 1.02) ) forest_plot(trials, point = "hr", lower = "lower", upper = "upper", label = "study", group = "region", scale = "log", null_value = 1 ) ``` ### Hierarchical (Nested) Grouping Pass multiple column names for nested subgroups: ```{r} nested <- data.frame( study = c("Site A", "Site B", "Site C", "Site D", "Site E", "Site F"), region = c("Americas", "Americas", "Americas", "Europe", "Europe", "Europe"), country = c("USA", "USA", "Brazil", "UK", "Germany", "Germany"), hr = c(0.72, 0.85, 0.79, 0.88, 0.91, 0.76), lower = c(0.58, 0.71, 0.62, 0.74, 0.78, 0.61), upper = c(0.89, 1.02, 1.01, 1.05, 1.06, 0.95) ) forest_plot(nested, point = "hr", lower = "lower", upper = "upper", label = "study", group = c("region", "country"), # Nested: region > country scale = "log", null_value = 1 ) ``` ## Working with Meta-Analysis Results ### From metafor ```{r} #| eval: false library(metafor) # Run meta-analysis res <- rma(yi = log_or, sei = se, data = studies, method = "REML") # Convert to webforest format forest_data <- studies |> mutate( or = exp(log_or), lower = exp(log_or - 1.96 * se), upper = exp(log_or + 1.96 * se) ) |> # Add pooled estimate as summary row bind_rows( tibble( study = "Pooled Estimate", or = exp(res$b), lower = exp(res$ci.lb), upper = exp(res$ci.ub), rtype = "summary", rbold = TRUE ) ) forest_plot(forest_data, point = "or", lower = "lower", upper = "upper", label = "study", row_type = "rtype", row_bold = "rbold", scale = "log", null_value = 1 ) ``` ### From meta Package ```{r} #| eval: false library(meta) # Run meta-analysis m <- metagen(TE = log_or, seTE = se, studlab = study, data = studies) # Extract study-level data forest_data <- tibble( study = m$studlab, or = exp(m$TE), lower = exp(m$lower), upper = exp(m$upper), weight = m$w.random / sum(m$w.random) * 100 ) |> bind_rows( tibble( study = "Random Effects", or = exp(m$TE.random), lower = exp(m$lower.random), upper = exp(m$upper.random), rtype = "summary", rbold = TRUE ) ) ``` ## Common Data Transformations ### Adding Row Types ```{r} # Transform flat data into structured layout raw <- data.frame( outcome = c("CV Death", "MI", "Stroke"), category = c("Primary", "Primary", "Secondary"), hr = c(0.82, 0.79, 0.88), lower = c(0.72, 0.68, 0.74), upper = c(0.94, 0.92, 1.05) ) structured <- raw |> group_by(category) |> group_modify(~ { header <- tibble( outcome = .y$category, hr = NA, lower = NA, upper = NA, rtype = "header", rbold = TRUE, rindent = 0 ) data <- .x |> mutate( outcome = paste0(" ", outcome), rtype = "data", rbold = FALSE, rindent = 1 ) bind_rows(header, data) }) |> ungroup() ``` ### Computing Weight Percentages ```{r} #| eval: false studies |> mutate( # Inverse-variance weight weight = 1 / se^2, weight_pct = weight / sum(weight) * 100 ) ``` ### Formatting Confidence Intervals If you want a pre-formatted CI column for display: ```{r} #| eval: false data |> mutate( ci_text = sprintf("%.2f (%.2f-%.2f)", hr, lower, upper) ) ``` Use `col_text("ci_text", "HR (95% CI)")` to display it, or use `col_interval()` for automatic formatting. ## Data Validation Tips 1. **Check for non-positive values before log scale**: `any(data$hr <= 0)` 2. **Verify CI ordering**: `all(data$lower <= data$hr & data$hr <= data$upper)` 3. **Check for character columns**: `sapply(data, class)` - numeric columns shouldn't be character 4. **Preview structured data**: Print the data frame to verify header/spacer row placement ```{r} # Quick validation function validate_forest_data <- function(data, point, lower, upper, scale = "linear") { issues <- character() p <- data[[point]] l <- data[[lower]] u <- data[[upper]] # Filter to non-NA (data rows only) valid <- !is.na(p) if (scale == "log" && any(p[valid] <= 0 | l[valid] <= 0 | u[valid] <= 0)) { issues <- c(issues, "Log scale requires all positive values") } if (any(l[valid] > p[valid] | p[valid] > u[valid])) { issues <- c(issues, "CI bounds should satisfy: lower <= point <= upper") } if (length(issues) == 0) { message("Data looks valid!") } else { warning(paste(issues, collapse = "\n")) } } ```