---
title: "Data Preparation"
---
```{r}
#| include: false
library(webforest)
library(dplyr)
```
This guide covers how to structure your data for webforest, handle missing values, and prepare data from common meta-analysis workflows.
::: {.callout-tip}
## Key Principle: Column Names, Not Values
webforest uses a **column-mapping pattern**: you specify which columns contain your data, not the values themselves. This makes it easy to use any data frame without renaming columns.
```r
# Your data can have any column names
my_data <- data.frame(my_study = ..., my_or = ..., my_lo = ..., my_hi = ...)
# Just map them to the right arguments
forest_plot(my_data, point = "my_or", lower = "my_lo", upper = "my_hi", label = "my_study")
```
:::
## Required Columns
At minimum, `forest_plot()` needs four column mappings:
| Argument | Description | Example |
|----------|-------------|---------|
| `point` | Point estimate (effect size) | Hazard ratio, odds ratio, mean difference |
| `lower` | Lower confidence interval bound | 95% CI lower |
| `upper` | Upper confidence interval bound | 95% CI upper |
| `label` | Row label text | Study name, subgroup |
```{r}
# Minimal example
data <- data.frame(
study = c("Smith 2020", "Jones 2021", "Lee 2022"),
hr = c(0.72, 0.85, 0.91),
lo = c(0.55, 0.70, 0.75),
hi = c(0.95, 1.03, 1.10)
)
forest_plot(data,
point = "hr", lower = "lo", upper = "hi", label = "study",
scale = "log", null_value = 1
)
```
## Optional Columns
Beyond the core four, you can map additional columns for styling and display:
| Category | Columns | Purpose |
|----------|---------|---------|
| **Grouping** | `group` | Hierarchical nesting |
| **Row styling** | `row_type`, `row_bold`, `row_indent`, `row_color`, `row_badge` | Per-row appearance |
| **Cell styling** | `style_cols`, `style_bold`, `style_color`, `style_bg` | Per-cell formatting |
| **Display** | Any column referenced in `columns = list(...)` | Extra data columns |
## Handling Missing Values (NA)
webforest uses `NA` values strategically for structured layouts:
### Header and Spacer Rows
Rows with `row_type = "header"` or `"spacer"` typically have `NA` for effect estimates. The plot renders these as label-only rows without intervals:
```{r}
structured <- data.frame(
label = c("Primary Outcomes", " CV Death", " MI", "", "Secondary"),
hr = c(NA, 0.82, 0.79, NA, NA),
lower = c(NA, 0.72, 0.68, NA, NA),
upper = c(NA, 0.94, 0.92, NA, NA),
rtype = c("header", "data", "data", "spacer", "header"),
rbold = c(TRUE, FALSE, FALSE, FALSE, TRUE)
)
forest_plot(structured,
point = "hr", lower = "lower", upper = "upper", label = "label",
row_type = "rtype", row_bold = "rbold",
scale = "log", null_value = 1
)
```
### Missing Effect Estimates
For data rows where an effect couldn't be calculated, `NA` values display the label without plotting an interval. This is useful for subgroups with insufficient data.
### Styling Column NAs
`NA` in styling columns (e.g., `row_color`, `row_badge`) means "use default" - no special styling is applied.
## Scale Considerations
### Log Scale
::: {.callout-warning}
## Log Scale Requires Positive Values
When using `scale = "log"`, all values in `point`, `lower`, and `upper` must be **positive**. Zero or negative values will cause rendering errors.
:::
When using `scale = "log"`, all values in `point`, `lower`, and `upper` must be positive:
```{r}
#| eval: false
# This will cause issues:
bad_data <- data.frame(
study = "Problematic",
or = 0, # Zero breaks log scale
lower = -0.1, # Negative breaks log scale
upper = 1.5
)
# Solution: Filter or handle before plotting
good_data <- your_data |>
filter(or > 0, lower > 0, upper > 0)
```
Typical `null_value` for log scale: `1` (ratio of 1 = no effect)
### Linear Scale
Linear scale accepts any numeric values including negatives:
```{r}
# Mean difference example (linear scale)
diff_data <- data.frame(
comparison = c("Treatment A", "Treatment B", "Treatment C"),
mean_diff = c(-2.5, 1.3, -0.8),
lower = c(-4.1, -0.2, -2.1),
upper = c(-0.9, 2.8, 0.5)
)
forest_plot(diff_data,
point = "mean_diff", lower = "lower", upper = "upper",
label = "comparison",
scale = "linear", null_value = 0,
axis_label = "Mean Difference (95% CI)"
)
```
Typical `null_value` for linear scale: `0` (difference of 0 = no effect)
## Creating Grouping Columns
### Single-Level Grouping
Use a categorical column to group rows:
```{r}
trials <- data.frame(
study = c("ADVANCE", "SPRINT", "ACCORD", "ONTARGET"),
region = c("Europe", "North America", "North America", "Global"),
hr = c(0.91, 0.75, 0.88, 0.94),
lower = c(0.83, 0.64, 0.76, 0.86),
upper = c(1.01, 0.87, 1.01, 1.02)
)
forest_plot(trials,
point = "hr", lower = "lower", upper = "upper",
label = "study", group = "region",
scale = "log", null_value = 1
)
```
### Hierarchical (Nested) Grouping
Pass multiple column names for nested subgroups:
```{r}
nested <- data.frame(
study = c("Site A", "Site B", "Site C", "Site D", "Site E", "Site F"),
region = c("Americas", "Americas", "Americas", "Europe", "Europe", "Europe"),
country = c("USA", "USA", "Brazil", "UK", "Germany", "Germany"),
hr = c(0.72, 0.85, 0.79, 0.88, 0.91, 0.76),
lower = c(0.58, 0.71, 0.62, 0.74, 0.78, 0.61),
upper = c(0.89, 1.02, 1.01, 1.05, 1.06, 0.95)
)
forest_plot(nested,
point = "hr", lower = "lower", upper = "upper",
label = "study",
group = c("region", "country"), # Nested: region > country
scale = "log", null_value = 1
)
```
## Working with Meta-Analysis Results
### From metafor
```{r}
#| eval: false
library(metafor)
# Run meta-analysis
res <- rma(yi = log_or, sei = se, data = studies, method = "REML")
# Convert to webforest format
forest_data <- studies |>
mutate(
or = exp(log_or),
lower = exp(log_or - 1.96 * se),
upper = exp(log_or + 1.96 * se)
) |>
# Add pooled estimate as summary row
bind_rows(
tibble(
study = "Pooled Estimate",
or = exp(res$b),
lower = exp(res$ci.lb),
upper = exp(res$ci.ub),
rtype = "summary",
rbold = TRUE
)
)
forest_plot(forest_data,
point = "or", lower = "lower", upper = "upper",
label = "study",
row_type = "rtype", row_bold = "rbold",
scale = "log", null_value = 1
)
```
### From meta Package
```{r}
#| eval: false
library(meta)
# Run meta-analysis
m <- metagen(TE = log_or, seTE = se, studlab = study, data = studies)
# Extract study-level data
forest_data <- tibble(
study = m$studlab,
or = exp(m$TE),
lower = exp(m$lower),
upper = exp(m$upper),
weight = m$w.random / sum(m$w.random) * 100
) |>
bind_rows(
tibble(
study = "Random Effects",
or = exp(m$TE.random),
lower = exp(m$lower.random),
upper = exp(m$upper.random),
rtype = "summary",
rbold = TRUE
)
)
```
## Common Data Transformations
### Adding Row Types
```{r}
# Transform flat data into structured layout
raw <- data.frame(
outcome = c("CV Death", "MI", "Stroke"),
category = c("Primary", "Primary", "Secondary"),
hr = c(0.82, 0.79, 0.88),
lower = c(0.72, 0.68, 0.74),
upper = c(0.94, 0.92, 1.05)
)
structured <- raw |>
group_by(category) |>
group_modify(~ {
header <- tibble(
outcome = .y$category,
hr = NA, lower = NA, upper = NA,
rtype = "header", rbold = TRUE, rindent = 0
)
data <- .x |>
mutate(
outcome = paste0(" ", outcome),
rtype = "data", rbold = FALSE, rindent = 1
)
bind_rows(header, data)
}) |>
ungroup()
```
### Computing Weight Percentages
```{r}
#| eval: false
studies |>
mutate(
# Inverse-variance weight
weight = 1 / se^2,
weight_pct = weight / sum(weight) * 100
)
```
### Formatting Confidence Intervals
If you want a pre-formatted CI column for display:
```{r}
#| eval: false
data |>
mutate(
ci_text = sprintf("%.2f (%.2f-%.2f)", hr, lower, upper)
)
```
Use `col_text("ci_text", "HR (95% CI)")` to display it, or use `col_interval()` for automatic formatting.
## Data Validation Tips
1. **Check for non-positive values before log scale**: `any(data$hr <= 0)`
2. **Verify CI ordering**: `all(data$lower <= data$hr & data$hr <= data$upper)`
3. **Check for character columns**: `sapply(data, class)` - numeric columns shouldn't be character
4. **Preview structured data**: Print the data frame to verify header/spacer row placement
```{r}
# Quick validation function
validate_forest_data <- function(data, point, lower, upper, scale = "linear") {
issues <- character()
p <- data[[point]]
l <- data[[lower]]
u <- data[[upper]]
# Filter to non-NA (data rows only)
valid <- !is.na(p)
if (scale == "log" && any(p[valid] <= 0 | l[valid] <= 0 | u[valid] <= 0)) {
issues <- c(issues, "Log scale requires all positive values")
}
if (any(l[valid] > p[valid] | p[valid] > u[valid])) {
issues <- c(issues, "CI bounds should satisfy: lower <= point <= upper")
}
if (length(issues) == 0) {
message("Data looks valid!")
} else {
warning(paste(issues, collapse = "\n"))
}
}
```