Tuesday, 18 July 2017

Fehlende Daten Imputation Binär Optionen


Imputationsstrategien für fehlende binäre Ergebnisse in Cluster-randomisierten Studien Hintergrund Die Attrition, die zu fehlenden Daten führt, ist ein häufiges Problem bei Cluster-randomisierten Studien (CRTs), bei denen Gruppen von Patienten und nicht Individuen randomisiert sind. Standard-Mehrfach-Imputations - (MI-) Strategien können nicht geeignet sein, fehlende Daten von CRTs zu veranlassen, da sie unabhängige Daten übernehmen. In dieser Arbeit, unter der Annahme fehlt völlig zufällig und kovariate abhängige fehlt, verglichen wir sechs MI-Strategien, die für die intra-Cluster-Korrelation für fehlende binäre Ergebnisse in CRTs mit den Standard-Imputationsstrategien und komplette Fallanalyse Ansatz mit einer Simulationsstudie . Wir betrachteten drei in-Cluster - und drei across-Cluster-MI-Strategien für fehlende binäre Ergebnisse in CRTs. Die drei innerhalb der Cluster-MI-Strategien sind die logistische Regressionsmethode, die Neigungsmethode und die Markov-Kette Monte Carlo (MCMC) - Methode, die Standard-MI-Strategien in jedem Cluster anwendet. Die drei across-Cluster-MI-Strategien sind die Neigungs-Score-Methode, die Zufalls-Effekte (RE) logistische Regression Ansatz und logistische Regression mit Cluster als fester Effekt. Basierend auf dem gemeinschaftlichen Hypertonie-Assessment-Test (CHAT), der vollständige Daten hat, haben wir eine Simulationsstudie entwickelt, um die Performance der oben genannten MI-Strategien zu untersuchen. Die geschätzte Behandlungswirkung und ihr 95 Konfidenzintervall (CI) aus verallgemeinerten Schätzgleichungen (GEE) - Modell auf Basis des CHAT-Komplettsatzes sind 1,14 (0,76 1,70). Wenn 30 der Binärergebnisse zufällig vollständig fehlen, zeigt eine Simulationsstudie, dass die geschätzten Behandlungseffekte und die entsprechenden 95 CIs aus dem GEE-Modell 1,15 (0,76 1,75) betragen, wenn eine vollständige Fallanalyse verwendet wird, 1,12 (0,72 1,73), wenn innerhalb des Clusters MCMC-Methode wird verwendet, 1,21 (0,80 1,81), wenn eine rechtwinklige Regressions-Regression verwendet wird, und 1,16 (0,82 1,64), wenn eine standardmäßige logistische Regression verwendet wird, die kein Clustering berücksichtigt. Schlussfolgerung Wenn der Prozentsatz der fehlenden Daten niedrig ist oder der Intra-Cluster-Korrelationskoeffizient klein ist, erzeugen unterschiedliche Ansätze zur Behandlung fehlender binärer Ergebnisdaten sehr ähnliche Ergebnisse. Wenn der Prozentsatz der fehlenden Daten groß ist, unterschätzen Standard-MI-Strategien, die die Intra-Cluster-Korrelation nicht berücksichtigen, die Varianz des Behandlungseffekts. In-Cluster - und across-Cluster-MI-Strategien (mit Ausnahme der MI-Strategie der zufälligen Effekte logistische Regression), die die Intra-Cluster-Korrelation berücksichtigen, scheinen besser geeignet zu sein, das fehlende Ergebnis von CRTs zu behandeln. Unter der gleichen Anrechnungsstrategie und dem Prozentsatz der Vermis - sung sind die Schätzungen der Behandlungseffekte von GEE - und RE-Logistik-Regressionsmodellen ähnlich. 1. Einleitung Cluster-randomisierte Studien (CRTs), in denen Gruppen von Teilnehmern statt Einzelpersonen randomisiert sind, werden zunehmend in der Gesundheitsförderung und im Gesundheitswesenforschung eingesetzt 1. Wenn die Teilnehmer innerhalb der gleichen Einstellung wie Krankenhaus-, Gemeinde - oder Familienarztpraxis verwaltet werden müssen, wird diese Randomisierungsstrategie in der Regel angenommen, um die potenzielle Behandlungskontamination zwischen Interventions - und Kontrollteilnehmern zu minimieren. Es wird auch verwendet, wenn individuelle Level-Randomisierung unangemessen, unethisch oder unmöglich sein kann 2. Die Hauptkonsequenz des Cluster-randomisierten Designs ist, dass die Teilnehmer aufgrund der Ähnlichkeit der Teilnehmer aus demselben Cluster nicht als unabhängig angenommen werden können. Diese Ähnlichkeit wird durch den Intra-Cluster-Korrelationskoeffizienten ICC quantifiziert. In Anbetracht der beiden Komponenten der Variation des Ergebnisses, zwischen-Cluster - und Intra-Cluster-Variationen, kann als der Anteil der Gesamtvariation des Ergebnisses interpretiert werden, der durch die Zwischen-Cluster-Variation 3 erklärt werden kann. Es kann auch als Korrelation zwischen den Ergebnissen für zwei Teilnehmer in demselben Cluster interpretiert werden. Es ist gut etabliert worden, dass die Berücksichtigung der intra-cluster-Korrelation in der Analyse die Chance erhöhen kann, statistisch signifikante, aber falsche Befunde zu erhalten 4. Das Risiko der Abreibung kann in einigen CRTs aufgrund des fehlenden direkten Kontakts mit einzelnen Teilnehmern und langwierigen Follow-up sehr hoch sein. Zusätzlich zu fehlenden Personen können die gesamten Cluster fehlen, was die Handhabung fehlender Daten in CRTs weiter kompliziert. Die Auswirkung der fehlenden Daten auf die Ergebnisse der statistischen Analyse hängt von dem Mechanismus ab, der die Daten verursacht hat und die Art und Weise, wie sie behandelt wird. Der Standardansatz im Umgang mit diesem Problem besteht darin, eine vollständige Fallanalyse zu verwenden (auch als Listlöschung bezeichnet), d. h. die Teilnehmer mit fehlenden Daten aus der Analyse auszuschließen. Obwohl dieser Ansatz einfach zu bedienen ist und die Standardoption in den meisten statistischen Paketen ist, kann er die statistische Leistung der Studie erheblich schwächen und kann auch zu voreingenommenen Ergebnissen führen, je nach dem Mechanismus der fehlenden Daten. Im Allgemeinen kann die Art oder Art der Fehlen in vier Kategorien passen: fehlt ganz zufällig (MCAR), fehlt zufällig (MAR), kovariatabhängig (CD) fehlt und fehlt nicht zufällig (MNAR) 6. Das Verständnis dieser Kategorien ist wichtig, da die Lösungen je nach Art der Fehlen variieren können. MCAR bedeutet, dass der fehlende Datenmechanismus, d. h. die Wahrscheinlichkeit des Fehlens, nicht von den beobachteten oder nicht beobachteten Daten abhängt. Sowohl MAR - als auch CD-Mechanismen zeigen an, dass Ursachen fehlender Daten nicht mit den fehlenden Werten in Beziehung stehen, sondern mit den beobachteten Werten zusammenhängen können. Im Rahmen von Längsschnittdaten, wenn serielle Messungen für jeden Einzelnen vorgenommen werden, bedeutet MAR, dass die Wahrscheinlichkeit einer fehlenden Antwort bei einem bestimmten Besuch entweder auf beobachtete Reaktionen bei früheren Besuchen oder Kovariaten zurückzuführen ist, während CD fehlt - ein Sonderfall von MAR - Bedeutet, dass die Wahrscheinlichkeit einer fehlenden Antwort nur von Kovariaten abhängig ist. MNAR bedeutet, dass die Wahrscheinlichkeit fehlender Daten von den unbeobachteten Daten abhängt. Es tritt häufig auf, wenn Menschen aus der Studie wegen schlechter oder guter Gesundheit Ergebnisse fallen. Eine wesentliche Unterscheidung zwischen diesen Kategorien ist, dass MNAR nicht ignorierbar ist, während die anderen drei Kategorien (d. h. MCAR, CD oder MAR) ignorierbar sind 7. Unter den Umständen der ignorierbaren Fehlen können die Anrechnungsstrategien wie die mittlere Anrede, das Heißdeck, die letzte Beobachtungsvorträge oder die Mehrfachanmeldung (MI), die jeden fehlenden Wert auf einen oder mehrere plausible Werte ersetzen, einen vollständigen Datensatz erzeugen, der nicht ist Nachteilig vorgespannt 8. 9. Nicht-ignorable fehlende Daten sind anspruchsvoller und erfordern einen anderen Ansatz 10. Zwei Hauptansätze bei der Handhabung fehlender Ergebnisse sind Wahrscheinlichkeitsanalysen und Zurechnung 10. In diesem Beitrag konzentrieren wir uns auf MI-Strategien, die die Variabilität oder Ungewissheit der fehlenden Daten berücksichtigen, um das fehlende binäre Ergebnis in CRTs zu vermitteln. Unter der Annahme von MAR ersetzen MI-Strategien jeden fehlenden Wert durch einen Satz von plausiblen Werten, um mehrere unterstellte Datensätze zu erstellen - in der Regel variieren in der Zahl von 3 bis 10 11. Diese mehrfach unterstellten Datensätze werden anhand von Standardverfahren für vollständige Daten analysiert. Ergebnisse aus den unterstellten Datensätzen werden dann zur Schlußfolgerung kombiniert, um das Endergebnis zu erzeugen. Standard-MI-Verfahren sind in vielen standardisierten statistischen Softwarepaketen wie SAS (Cary, NC), SPSS (Chicago IL) und STATA (College Station, TX) verfügbar. Allerdings gehen diese Verfahren davon aus, dass Beobachtungen unabhängig sind und möglicherweise nicht für CRTs geeignet sind, da sie die Intra-Cluster-Korrelation nicht berücksichtigen. Nach unserem besten Wissen wurde eine begrenzte Untersuchung über die Imputationsstrategien für fehlende binäre Ergebnisse oder kategorische Ergebnisse in CRTs durchgeführt. Yi und Cook berichteten über marginale Methoden für fehlende Längsdaten aus gruppiertem Design 12. Hunsberger et al. 13 beschrieben drei Strategien für kontinuierliche fehlende Daten in CRTs: 1) Mehrfach-Imputationsverfahren, bei dem die fehlenden Werte durch neu abgetastete Werte aus den beobachteten Daten ersetzt werden 2) eine mediane Prozedur, die auf dem Wilcoxon-Rang-Summen-Test basiert, der die fehlenden Daten in der Interventionsgruppe mit den schlechtesten Rängen 3) Mehrfach-Imputationsverfahren, bei dem die fehlenden Werte durch die vorhergesagten Werte aus einer Regressionsgleichung ersetzt werden. Nixon et al. 14 präsentierten Strategien, um fehlende Endpunkte von einem Surrogat zu vermitteln. Bei der Analyse eines kontinuierlichen Ergebnisses aus dem gemeinschaftlichen Interventionsverfahren für Raucherentwöhnung (COMMIT) hat Green et al. Einzelne Teilnehmer in Gruppen geschätzt, die in Bezug auf das vorhergesagte Ergebnis homogener waren. Innerhalb jeder Schicht haben sie das fehlende Ergebnis unter Verwendung der beobachteten Daten 15 berechnet. 16. Taljaard et al 17 verglichen mehrere verschiedene Imputationsstrategien für fehlende kontinuierliche Ergebnisse in CRTs unter der Annahme, dass sie völlig zufällig fehlen. Diese Strategien umfassen die Cluster-Mittelwert-Imputation, innerhalb des Clusters MI mit Approximate Bayesian Bootstrap (ABB) - Methode, gepoolte MI mit ABB-Methode, Standard-Regression MI und Mixed-Effects Regression MI. Wie aus Kenward hervorgeht, dass, wenn ein substantives Modell, wie ein allgemeines lineares gemischtes Modell, verwendet werden soll, das die Datenstruktur widerspiegelt, ist es wichtig, dass das Imputationsmodell auch diese Struktur 18 reflektiert. Die Ziele dieses Aufsatzes sind: i) die Leistungsfähigkeit verschiedener Anrechnungsstrategien für fehlende binäre Ergebnisse in CRTs unter verschiedenen Prozentsätzen der Vermis - sung zu untersuchen, wobei man einen Mechanismus fehlt, der vollständig zufällig fehlt oder kovariatabhängig fehlt. Ii) die Übereinstimmung zwischen dem vollständigen Datensatz vergleichen Und die unterstellten Datensätze, die aus verschiedenen Imputationsstrategien gewonnen wurden, vergleichen die Robustheit der Ergebnisse unter zwei häufig verwendeten statistischen Analysemethoden: die generalisierten Schätzgleichungen (GEE) und die logistische Regression der Random-Effekte (RE) unter verschiedenen Imputationsstrategien. 2. Methoden In diesem Beitrag betrachten wir drei innerhalb-Cluster und drei across-Cluster-MI-Strategien für fehlende binäre Ergebnisse in CRTs. Die drei MI-Strategien innerhalb des Clusters sind logistische Regressionsmethoden, die Methode der Neigungsbewertung und die MCMC-Methode, bei denen es sich um Standard-MI-Strategien handelt, die in jedem Cluster durchgeführt werden. Die drei Cross-Cluster-MI-Strategien sind die Neigungsbewertung, die Zufalls-Effekte-Logistik-Regressionsmethode und die logistische Regression mit dem Cluster als fester Effekt. Basierend auf dem kompletten Datensatz aus der gemeinschaftlichen Hypertonie-Assessment-Studie (CHAT) führten wir eine Simulationsstudie durch, um die Performance der oben genannten MI-Strategien zu untersuchen. Wir haben Kappa-Statistiken verwendet, um die Vereinbarung zwischen den unterstellten Datensätzen und dem vollständigen Datensatz zu vergleichen. Wir haben auch die geschätzten Behandlungseffekte aus dem GEE - und RE-Logistik-Regressionsmodell 19 verwendet, um die Robustheit der Ergebnisse unter verschiedenen Prozentsätzen des fehlenden binären Ergebnisses unter der Annahme von MCAR und CD zu verifizieren. 2.1. Komplette Fallanalyse Bei diesem Ansatz werden nur die Patienten mit abgeschlossenen Daten zur Analyse einbezogen, während Patienten mit fehlenden Daten ausgeschlossen sind. Wenn die Daten MCAR sind, gilt der vollständige Fallanalyseansatz, der entweder eine Likelihood-basierte Analyse wie eine RE-Logistik-Regression oder das Randmodell wie den GEE-Ansatz verwendet, für die Analyse von Binärergebnissen aus CRTs, da der fehlende Datenmechanismus unabhängig ist Ergebnis. Wenn die Daten CD fehlen, sind sowohl die RE-Logistik-Regression als auch der GEE-Ansatz gültig, wenn die bekannten Kovariaten, die mit dem fehlenden Datenmechanismus verknüpft sind, angepasst werden. Es kann mit GENMOD und NLMIXED Verfahren in SAS implementiert werden. 2.2 Standard-Mehrfach-Imputation Unter der Annahme, dass die Beobachtungen unabhängig sind, können wir die Standard-MI-Verfahren anwenden, die von einer standardmäßigen statistischen Software wie SAS bereitgestellt werden. Drei weit verbreitete MI-Methoden sind prädiktive Modellmethoden (logistische Regressionsmethode für Binärdaten), Methode der Schadensermittlung und MCMC-Methode 20. Im Allgemeinen werden sowohl die Neigungsbewertungsmethode als auch das MCMC-Verfahren für die Anrechnung der kontinuierlichen Variablen 21 empfohlen. Ein Datensatz soll ein monotones fehlendes Muster haben, wenn eine Messung Y j für eine Person fehlt, impliziert, dass alle nachfolgenden Messungen Y k. K gt j Fehlen alle für das Individuum. Wenn die Daten im monotonen fehlenden Muster fehlen, ist jedes der parametrischen prädiktiven Modell und die nichtparametrische Methode, die Neigungswerte oder MCMC-Methode verwendet, geeignet 21. Für ein beliebiges fehlendes Datenmuster kann eine MCMC-Methode verwendet werden, die eine multivariate Normalität annimmt. Diese MI-Strategien werden mit MI-, MIANALYZE-, GENMOD - und NLMIXED-Prozeduren in SAS für jede Interventionsgruppe separat implementiert. 2.2.1. Logistische Regressionsmethode Bei diesem Ansatz wird ein logistisches Regressionsmodell mit dem beobachteten Ergebnis und den Kovariaten platziert. Basierend auf den Parameterschätzungen und der zugehörigen Kovarianzmatrix kann die hintere prädiktive Verteilung der Parameter aufgebaut werden. Ein neues logistisches Regressionsmodell wird dann aus der hinteren prädiktiven Verteilung der Parameter simuliert und dient dazu, die fehlenden Werte zu berechnen. 2.2.2. Propensity Score Methode Die Neigung ist die bedingte Wahrscheinlichkeit des Fehlens bei den beobachteten Daten. Es kann durch das Mittel des logistischen Regressionsmodells mit einem binären Ergebnis geschätzt werden, das angibt, ob die Daten fehlen oder nicht. Die Beobachtungen werden dann in eine Anzahl von Schichten auf der Grundlage dieser Neigungswerte geschichtet. Die ABB-Prozedur 22 wird dann auf jede Schicht aufgebracht. Die ABB-Imputation zeichnet zuerst den Ersatz aus den beobachteten Daten, um einen neuen Datensatz zu erstellen, der ein nichtparametrisches Analogon der Zeichnungsparameter aus der hinteren prädiktiven Verteilung der Parameter ist und dann zufällig unterstellte Werte mit dem Ersatz aus dem neuen Datensatz zeichnet. 2.2.3. Markov-Kette Monte-Carlo-Methode Bei Verwendung der MCMC-Methode werden Pseudo-Stichproben aus einer Zielwahrscheinlichkeitsverteilung 21 gezogen. Die Zielverteilung ist die gemeinsame bedingte Verteilung von Y mis und gegeben Y obs, wenn fehlende Daten ein nicht monotones Muster haben, wobei Y mis und Y obs die fehlenden Daten und beobachteten Daten darstellen und die unbekannten Parameter darstellen. Die MCMC-Methode wird wie folgt durchgeführt: Ersetzen Sie Y mis durch einige angenommene Werte, dann simulieren Sie aus der resultierenden vollständigen Daten posterior Verteilung P (Y obs, Y mis). Es sei (t) der aktuelle simulierte Wert von. Dann kann aus der bedingten prädiktiven Verteilung Y m i s (t 1) P (Y m i s Y o b s (t)) Y mis (t 1) gezogen werden. Konditionierung auf Y mis (t 1). Kann der nächste simulierte Wert aus seiner vollständigen Daten posterior Verteilung (t 1) P (Y o b s. Y m i s (t 1)) gezogen werden. Durch Wiederholen des obigen Vorgangs können wir eine Markov-Kette erzeugen, die in der Verteilung zu P (Y mis, Y obs) konvergiert. Diese Methode ist attraktiv, da sie eine komplizierte analytische Berechnung der hinteren Verteilung von und Y mis vermeidet. Allerdings ist die Verteilungskonvergenz ein Thema, dem die Forscher begegnen müssen. Darüber hinaus basiert diese Methode auf der Annahme einer multivariaten Normalität. Bei der Verwendung von Binärvariablen können die unterstellten Werte beliebige reelle Werte sein. Die meisten der unterstellten Werte liegen zwischen 0 und 1, einige sind außerhalb dieses Bereichs. Wir runden die unterstellten Werte auf 0 ab, wenn sie kleiner als 0,5 und 1 ist. Diese Mehrfach-Imputationsmethode wird mit dem MI-Verfahren in SAS implementiert. Wir verwenden eine einzige Kette und nicht-informative Vorgänger für alle Imputationen und Erwartungsmaximierung (EM) Algorithmus, um Maximum-Likelihood-Schätzungen in parametrischen Modellen für unvollständige Daten zu finden und Parameter-Schätzungen aus einem posterioren Modus abzuleiten. Die Iterationen gelten als konvergiert, wenn die Änderung der Parameterschätzungen zwischen Iterationsschritten für jeden Parameter kleiner als 0,0001 ist. 2.3 In-Cluster-Mehrfach-Imputation Standard-MI-Strategien sind für die Handhabung der fehlenden Daten aus CRTs aufgrund der Annahme von unabhängigen Beobachtungen unangemessen. Für die In-Cluster-Imputation führen wir Standard-MI, wie oben beschrieben, mit Hilfe der logistischen Regressionsmethode, der Neigungsbewertungsmethode und der MCMC-Methode für jeden Cluster separat durch. Somit werden die fehlenden Werte auf der Grundlage der beobachteten Daten innerhalb des gleichen Clusters wie die fehlenden Werte berechnet. Angesichts der Tatsache, dass Themen innerhalb desselben Clusters eher einander ähnlich sind als die von verschiedenen Clustern, kann die Cluster-Imputation als eine Strategie betrachtet werden, um die fehlenden Werte zu berücksichtigen, um die Intra-Cluster-Korrelation zu berücksichtigen. Diese MI-Strategien werden mit MI-, MIANALYZE-, GENMOD - und NLMIXED-Prozeduren in SAS implementiert. 2.4 Across-Cluster Mehrfach-Imputation 2.4.1. Propensity-Score-Methode Im Vergleich zu der Standard-Mehrfach-Imputation mit der Neigungs-Score-Methode, fügten wir Cluster als eines der Kovariaten, um die Neigung für jede Beobachtung zu erhalten. Folglich werden Patienten innerhalb desselben Clusters eher in die gleiche Neigungsbewertung geteilt. Daher wird die Intra-Cluster-Korrelation berücksichtigt, wenn die ABB-Prozedur innerhalb jeder Schicht angewendet wird, um die unterstellten Werte für die fehlenden Daten zu erzeugen. Diese Mehrfach-Imputationsstrategie wird mit MI-, MIANALYZE-, GENMOD - und NLMIXED-Prozeduren in SAS implementiert. 2.4.2. Random-Effekte Logistische Regression Im Vergleich zum prädiktiven Modell mit der standardmäßigen logistischen Regressionsmethode gehen wir davon aus, dass das binäre Ergebnis durch das Logomodell der zufälligen Effekte modelliert wird: log it (Pr (Y ijl 1)) X ijl U ij wobei Y ijl das ist Binäres Ergebnis des Patienten l im Cluster j in der Interventionsgruppe i X ijl ist die Matrix von vollständig beobachteten individuellen oder clusterniveau Kovariaten, U ij N (0. B 2) repräsentiert den zufälligen Effekt auf Clusterebene und B 2 repräsentieren Die Zwischen-Cluster-Varianz. B 2 kann bei der Anpassung der zufälligen Effekte Logistik Regressionsmodell mit dem beobachteten Ergebnis und Kovariaten geschätzt werden. Die MI-Strategie, die die Zufalls-Effekte-Logistik-Regressionsmethode verwendet, erhält die unterstellten Werte in drei Schritten: (1) Passen Sie ein zufälliges Effekt-Logistik-Regressionsmodell wie oben beschrieben mit dem beobachteten Ergebnis und Kovariaten an. Basierend auf den aus Schritt (1) und der zugehörigen Kovarianzmatrix erhaltenen Schätzungen für und B konstruieren die hintere prädiktive Verteilung dieser Parameter. Setzen Sie eine neue zufällige Effekte logistische Regression mit den simulierten Parametern aus der posterioren prädiktiven Verteilung und den beobachteten Kovariaten, um das unterstellte fehlende Ergebnis zu erhalten. Die MI-Strategie, die die logistische Regression der zufälligen Effekte verwendet, berücksichtigt die zwischen der Clustervarianz, die in der MI-Strategie unter Verwendung der standardmäßigen logistischen Regression ignoriert wird, und kann daher für die Eingabe von fehlenden Binärdaten in CRTs gültig sein. Wir liefern den SAS-Code für diese Methode in Anhang A. 2.4.3. Logistische Regression mit Cluster als fester Effekt Im Vergleich zum Vorhersagemodell mit der standardmäßigen logistischen Regressionsmethode fügen wir Cluster als festen Effekt hinzu, um den Clustering-Effekt zu berücksichtigen. Diese Mehrfach-Imputationsstrategie wird mit MI-, MIANALYZE-, GENMOD - und NLMIXED-Prozeduren in SAS implementiert. 3. Simulationsstudie 3.1. Studie der gemeinschaftlichen Hypertonie Die CHAT-Studie wurde an anderer Stelle ausführlich berichtet 23. Kurz gesagt, es war eine Cluster-randomisierte kontrollierte Studie mit dem Ziel, die Wirksamkeit der Apotheke basierte Blutdruck (BP) Kliniken von Peer Gesundheit Pädagogen, mit Rückmeldung an Familienärzte (FP) auf die Verwaltung und Überwachung von BP unter Patienten 65 Jahre oder älter. Die FP war die Einheit der Randomisierung. Patienten aus dem gleichen FP erhielten die gleiche Intervention. Insgesamt nahmen 28 FPs an der Studie teil. Vierzehn wurden zufällig der Intervention zugeteilt (Apotheke BP Kliniken) und 14 an die Kontrollgruppe (keine BP Kliniken angeboten). Fünfundfünfzig Patienten wurden zufällig aus jedem FP-Programm ausgewählt. Deshalb nahmen 1540 Patienten an der Studie teil. Alle berechtigten Patienten in der Intervention und Kontrollgruppe erhielten üblichen Gesundheitsdienst an ihrem Büro der FPs. Patienten in den Praktiken, die der Interventionsgruppe zugewiesen wurden, wurden eingeladen, die Community BP Kliniken zu besuchen. Peer Gesundheit Pädagogen unterstützt Patienten, ihre BP zu messen und ihre Herz-Kreislauf-Risikofaktoren zu überprüfen. Forschung Krankenschwestern führten die Grundlinie und End-of-Trial (12 Monate nach der Randomisierung) Audits der Gesundheitsakten der 1540 Patienten, die an der Studie teilgenommen. Das primäre Ergebnis der CHAT-Studie war ein binäres Ergebnis, das angibt, ob die Patienten BP kontrolliert wurden oder nicht am Ende der Studie. Die Patienten BP wurden kontrolliert, wenn am Ende des Versuches der systolische BP 140 mmHg und diastolische BP 90 mmHg für Patienten ohne Diabetes oder Zielorganschaden oder das systolische BP 130 mmHg und diastolische BP 80 mmHg für Patienten mit Diabetes oder Zielorganschaden . Neben der Interventionsgruppe nahmen andere Prädiktoren in diesem Papier das Alter (kontinuierliche Variable), das Geschlecht (binäre Variable), den Diabetes an der Baseline (binäre Variable), die Herzerkrankung am Baseline (binäre Variable) und ob die Patienten BP in der Baseline ( Binäre Variable). Am Ende des Prozesses wurden 55 Patienten BP kontrolliert. Ohne irgendwelche anderen Prädiktoren in das Modell einzubeziehen, waren die Behandlungseffekte und ihre 95 Konfidenzintervalle (CI), die aus dem GEE - und RE-Modell geschätzt wurden, 1,14 (0,72, 1,80) bzw. 1,10 (0,65, 1,86). Der geschätzte ICC betrug 0,077. Nach der Anpassung der oben genannten Variablen betrugen die Behandlungseffekte und ihre CIs aus dem GEE - und RE-Modell 1,14 (0,76, 1,70) bzw. 1,12 (0,72, 1,76). Der geschätzte ICC betrug 0,055. Da es im CHAT-Datensatz keine fehlenden Daten gibt, bietet es uns eine bequeme Plattform, eine Simulationsstudie zu entwerfen, um die unterstellten und die beobachteten Werte zu vergleichen und die Leistungsfähigkeit der verschiedenen Mehrfach-Imputationsstrategien unter verschiedenen fehlenden Datenmechanismen und Prozentsätzen der Fehlen zu untersuchen . 3.2 Erzeugen von Datensatz mit fehlenden Binärergebnissen Mit dem CHAT-Studiendatensatz untersuchten wir die Performance verschiedener MI-Strategien für fehlende Binärergebnisse auf Basis von MCAR - und CD-Mechanismen. Unter der Annahme von MCAR haben wir einen Datensatz mit einem bestimmten Prozentsatz des fehlenden binären Ergebnisses erzeugt, der angibt, ob der BP am Ende der Studie für jeden Patienten kontrolliert wurde oder nicht. Die Wahrscheinlichkeit, für jeden Patienten zu fehlen, war völlig zufällig, d. h. die Wahrscheinlichkeit des Fehlens hing nicht von irgendwelchen beobachteten oder nicht beobachteten CHAT-Daten ab. Unter der Annahme von CD fehlt, betrachteten wir Sex, Behandlungsgruppe, ob Patienten BP kontrolliert oder nicht zu Baseline, die häufig mit Drop-out in klinischen Studien und Beobachtungsstudien 24 26 assoziiert waren, wurden mit der Wahrscheinlichkeit des Fehlens verbunden. Wir haben weiter davon ausgegangen, dass männliche Patienten waren 1,2-mal häufiger fehlende Ergebnis Patienten, die der Kontrollgruppe zugeteilt wurden 1,3 mal häufiger fehlende Ergebnis Patienten, deren BP wurde nicht auf Grundlinie kontrolliert wurden 1,4-mal häufiger fehlende Ergebnis als Patienten, deren BP wurden zu Grundlinie kontrolliert. 3.3. Entwurf der Simulationsstudie Zuerst verglichen wir die Übereinstimmung zwischen den Werten der unterstellten Ergebnisvariablen und den wahren Werten der Ergebnisvariablen mit Kappa-Statistiken. Die Kappa-Statistik ist die am häufigsten verwendete Statistik für die Beurteilung der Vereinbarung zwischen zwei Beobachtern oder Methoden, die berücksichtigen, dass sie manchmal einfach zufällig oder nicht einverstanden sind 27. Es wird berechnet auf der Grundlage der Unterschiede zwischen der Art und Weise, wie viel Vereinbarung tatsächlich vorhanden ist, im Vergleich zu, wie viel Vereinbarung erwartet werden, um durch Zufall allein zu sein. Ein Kappa von 1 zeigt die perfekte Übereinstimmung an, und 0 zeigt eine Übereinstimmung mit dem Zufall an. Die Kappa-Statistik wurde von Forschern weit verbreitet, um die Leistungsfähigkeit verschiedener Anrechnungsmethoden bei der Abgabe fehlender kategorialer Daten zu bewerten. 29. 29. Zweitens, unter MCAR und CD fehlen, verglichen wir die Behandlungseffektschätzungen aus den RE - und GEE-Methoden unter den folgenden Szenarien: 1) schließen die fehlenden Werte aus der Analyse aus, dh komplette Fallanalyse 2) gelten standardmäßig mehrere Imputationsstrategien, die nicht nehmen Die Intra-Cluster-Korrelation in Rechnung 3) gelten die innerhalb-Cluster-Imputation-Strategien und 4) gelten die Cross-Cluster-Imputation-Strategien. Wir haben die Simulationsstudie nach den folgenden Schritten entworfen. 1) Generierte 5, 10, 15, 20, 30 und 50 fehlende Ergebnisse unter MCAR und CD fehlende Annahme. Diese fehlenden Mengen wurden gewählt, um den Bereich der möglichen Fehlen in der Praxis zu decken. Angewendet die oben genannten mehrfachen Anrechnungsstrategien, um m 5 Datensätze zu erzeugen. Laut Rubin steigt die relative Effizienz des MI nicht viel bei der Erzeugung von mehr als 5 unterstellten Datensätzen 11. Berechnete Kappa-Statistik zur Bewertung der Vereinbarung zwischen den Werten der unterstellten Ergebnisvariablen und den wahren Werten der Ergebnisvariablen. Erhalten die Schätzung der einzelnen Behandlungseffekte durch die Kombination der Effektschätzungen aus den 5 unterstellten Datensätzen mit dem GEE - und RE-Modell. Wiederholte die oben genannten vier Schritte für 1000 Mal, d. h. nehmen 1000 Simulationsläufe. Berechnet die Gesamt-Kappa-Statistik durch Mittelung der Kappa-Statistik aus den 1000 Simulationsläufen. Berechnet den Gesamtbehandlungseffekt und seinen Standardfehler durch Mittelung der Behandlungseffekte und deren Standardfehler aus den 1000 Simulationsläufen. 4. Ergebnisse 4.1. Ergebnisse, wenn Daten vollständig zufällig fehlen Mit 5, 10, 15, 20, 30 oder 50 Prozent der Fehlenheit unter MCAR Annahme, die geschätzten Kappa für alle verschiedenen Anrechnungsstrategien sind etwas über 0,95, 0,90, 0,85, 0,80, 0,70 und 0,50 beziehungsweise. Die geschätzten Kappa für verschiedene Anrechnungsstrategien mit unterschiedlichem Prozentsatz der fehlenden Ergebnisse unter der Annahme von MCAR sind in Tabelle 1 detailliert dargestellt. Kappa-Statistik für verschiedene Anrechnungsstrategien, wenn die Fehlen vollständig zufällig ist Behandlungswirkung, die aus der zufälligen Effekte logistische Regression geschätzt wird, wenn 30 Daten vorliegen Ist kovariate abhängig fehlt. 5. Diskussion In dieser Arbeit, unter der Annahme von MCAR und CD fehlt, verglichen wir sechs MI-Strategien, die für die Intra-Cluster-Korrelation für fehlende binäre Ergebnisse in CRTs mit den Standard-Imputationsstrategien und vollständigen Fallanalyseansatz mit einer Simulationsstudie verantwortlich sind. Unsere Ergebnisse zeigen, dass erstens, wenn der Prozentsatz der fehlenden Daten niedrig ist oder der Intra-Cluster-Korrelationskoeffizient klein ist, unterschiedliche Anpassungsstrategien oder vollständige Fallanalyseansätze sehr ähnliche Ergebnisse erzeugen. Zweitens unterschätzen Standard-MI-Strategien, die die Intra-Cluster-Korrelation nicht berücksichtigen, die Varianz der Behandlungseffekte unterschätzen. Daher können sie zu statistisch signifikanten aber falschen Schlussfolgerungen führen, wenn sie verwendet werden, um mit den fehlenden Daten von CRTs umzugehen. Drittens, unter der Annahme von MCAR und CD fehlt, sind die Punktschätzungen (OR) ziemlich ähnlich über verschiedene Ansätze, um die fehlenden Daten zu behandeln, außer für zufällige Effekte logistische Regression MI Strategie. Viertens berücksichtigen sowohl die Cluster - als auch die across-Cluster-MI-Strategien die Intra-Cluster-Korrelation und liefern viel konservative Behandlungseffektschätzungen im Vergleich zu MI-Strategien, die den Clustering-Effekt ignorieren. Fünfte, innerhalb-Cluster-Imputationsstrategien führen zu einem breiteren CI als Cross-Cluster-Imputationsstrategien, vor allem, wenn der Prozentsatz der Fehlenheit hoch ist. Dies kann daran liegen, dass innerhalb von Cluster-Imputationsstrategien nur ein Bruchteil von Daten verwendet wird, was zu einer starken Variation des geschätzten Behandlungswirkens führt. Sechste, größere geschätzte Kappa, die eine höhere Übereinstimmung zwischen den unterstellten Werten und den beobachteten Werten anzeigt, ist mit einer besseren Leistung von MI-Strategien in Bezug auf die Erzeugung eines geschätzten Behandlungswirkens und 95 CI näher an denjenigen, die aus dem vollständigen CHAT-Datensatz erhalten wurden, verbunden. Siebtens, unter der gleichen Anrechnungsstrategie und Prozentsatz der Fehlen, sind die Schätzungen der Behandlung Wirkung von GEE und RE logistische Regressionsmodelle ähnlich. Nach unserem besten Wissen wurde eine begrenzte Arbeit zum Vergleich verschiedener multipler Imputationsstrategien für fehlende binäre Ergebnisse in CRTs durchgeführt. Taljaard et al. Verglichen vier MI-Strategien (gepoolte ABB, innerhalb des Clusters ABB, Standardregression, Mixed-Effects Regression) für fehlende kontinuierliche Ergebnisse in CRTs, wenn fehlt völlig zufällig. Ihre Ergebnisse sind ähnlich wie unsere. Es ist anzumerken, dass innerhalb von Cluster-MI-Strategien nur dann anwendbar sein können, wenn die Clustergröße ausreichend groß ist und der Prozentsatz der Fehlen relativ klein ist. In der CHAT-Studie gab es 55 Patienten in jedem Cluster, die genügend Daten zur Durchführung der innerhalb-Cluster-Imputationsstrategien unter Verwendung von Neigungs - und MCMC-Methode lieferten. Die logistische Regressionsmethode ist jedoch fehlgeschlagen, als der Prozentsatz der Fehlenheit hoch war. Dies war, weil bei der Erzeugung großer Prozentsatz (20) der fehlenden Ergebnis, alle Patienten mit binären Ergebnis von 0 wurden simuliert als fehlt für einige Cluster. Daher fehlte das logistische Regressionsmodell für diese Cluster. Darüber hinaus zeigen unsere Ergebnisse, dass der komplette Fallanalyse-Ansatz relativ gut läuft, auch bei 50 fehlenden. Wir denken, dass aufgrund der Intra-Cluster-Korrelation man nicht erwarten würde, dass die fehlenden Werte viel Einfluss haben, wenn ein großer Teil eines Clusters noch vorhanden ist. Allerdings wird eine weitere Untersuchung über dieses Problem mit einer Simulationsstudie hilfreich sein, um diese Frage zu beantworten. Unsere Ergebnisse zeigen, dass die Logistik-Regressionsstrategie für die Cluster-Zufalls-Effekte zu einer potenziell voreingenommenen Schätzung führt, vor allem, wenn der Prozentsatz der Fehlenheit hoch ist. Wie wir in Abschnitt 2.4.2 beschrieben haben, gehen wir davon aus, dass die zufälligen Effekte auf Clusterebene der Normalverteilung folgen, d. h. U i j N (0. B 2). Forscher haben gezeigt, dass die Fehlerstellung der Verteilungsform wenig Einfluss auf die Schlussfolgerungen über die festen Effekte hat 31. Falsch vorausgesetzt, dass die zufällige Effekte Verteilung unabhängig von der Clustergröße ist, kann sich die Schlussfolgerungen über den Intercept beeinträchtigen, beeinflusst jedoch nicht ernsthaft die Schlussfolgerungen über die Regressionsparameter. Allerdings kann man falsch davon ausgehen, dass die zufällige Wirkungsverteilung unabhängig von Kovariaten ist, die Auswirkungen auf die Regressionsparameter 32 erheblich beeinflussen. 33. Der Mittelwert der zufälligen Effekte Verteilung könnte mit einem Kovariate assoziiert werden, oder die Varianz der zufälligen Effekte Verteilung könnte mit einem Kovariate für unsere Datensatz assoziiert werden, die die potenzielle Bias aus der Cross-Cluster-Zufalls-Effekte Logistik-Regression Strategie erklären könnte. Im Gegensatz dazu hat die Imputationsstrategie der logistischen Regression mit Cluster als fester Effekt eine bessere Performance. Allerdings kann es nur angewendet werden, wenn die Clustergröße groß genug ist, um eine stabile Schätzung für den Cluster-Effekt bereitzustellen. For multiple imputation, the overall variance of the estimated treatment effect consists of two parts: within imputation variance U . and between imputation variance B . The total variance T is calculated as T U (1 1 m ) B . where m is the number of imputed datasets 10 . Since standard MI strategies ignore the between cluster variance and fail to account for the intra-cluster correlation, the within imputation variance may be underestimated, which could lead to underestimation of the total variance and consequently the narrower confidence interval. In addition, the adequacy of standard MI strategies depends on the ICC. In our study, the ICC of the CHAT dataset is 0.055 and the cluster effect in the random-effects model is statistically significant. Among the three imputation methods: predictive model (logistic regression method), propensity score method, and MCMC method, the latter is most popular method for multiple imputation of missing data and is the default method implemented in SAS. Although this method is widely used to impute binary and polytomous data, there are concerns about the consequences of violating the normality assumption. Experience has repeatedly shown that multiple imputation using MCMC method tends to be quite robust even when the real data depart from the multivariate normal distribution 20 . Therefore, when handling the missing binary or ordered categorical variables, it is acceptable to impute under a normality assumption and then round off the continuous imputed values to the nearest category. For example, the imputed values for the missing binary variable can be any real value rather than being restricted to 0 and 1. We rounded the imputed values so that values greater than or equal to 0.5 were set to 1, and values less than 0.5 were set to 0 34 . Horton et al 35 showed that such rounding may produce biased estimates of proportions when the true proportion is near 0 or 1, but does well under most other conditions. The propensity score method is originally designed to impute the missing values on the response variables from the randomized experiment with repeated measures 21 . Since it uses only the covariate information associated with the missingness but ignores the correlation among variables, it may produce badly biased estimates of regression coefficients when data on predictor variables are missing. In addition, with small sample sizes and a relatively large number of propensity score groups, application of the ABB method is problematic, especially for binary variables. In this case, a modified version of ABB should be conducted 36 . There are some limitations that need to be acknowledged and addressed regarding the present study. First, the simulation study is based on a real dataset, which has a relatively large cluster size and small ICC. Further research should investigate the performance of different imputation strategies at different design settings. Second, the scenario of missing an entire cluster is not investigated in this paper. The proposed within-cluster and across-cluster MI strategies may not apply to this scenario. Third, we investigate the performance of different MI strategies assuming missing data mechanism of MCAR and CD missing. Therefore, results cannot be generalized to MAR or MNAR scenarios. Fourth, since the estimated treatment effects are similar under different imputation strategies, we only presented the OR and 95 CI for each simulation scenario. However, estimates of standardized bias and coverage would be more informative and would also provide a quantitative guideline to assess the adequacy of imputes 37 . 6. Conclusions When the percentage of missing data is low or intra-cluster correlation coefficient is small, different imputation strategies or complete case analysis approach generate quite similar results. When the percentage of missing data is high, standard MI strategies, which do not take into account the intra-cluster correlation, underestimate the variance of the treatment effect. Within-cluster and across-cluster MI strategies (except for the random-effects logistic regression MI strategy), which take the intra-cluster correlation into account, seem to be more appropriate to handle the missing outcome from CRTs. Under the same imputation strategy and percentage of missingness, the estimates of the treatment effect from GEE and RE logistic regression models are similar. Appendix A: SAS code for across-cluster random-effects logistic regression method let maximum 1000 ods listing close proc nlmixed data mcaramppercentampindex cov parms b0 -0.0645 bgroup -0.1433 bdiabbase -0.04 bhdbase 0.1224 bage -0.0066 bbasebpcontrolled 1.1487 bsex 0.0873 s2u 0.5 Population Health Research Institute, Hamilton Health Sciences References Campbell MK, Grimshaw JM: Cluster randomised trials: time for improvement. The implications of adopting a cluster design are still largely being ignored. BMJ. 1998, 317 (7167): 1171-1172. View Article PubMed PubMed Central Google Scholar COMMIT Research Group: Community Intervention trial for Smoking Cessation (COMMIT): 1. Cohort results from a four-year community intervention. Am J Public Health. 1995, 85: 183-192. 10.2105AJPH.85.2.183. View Article Google Scholar Donner A, Klar N: Design and Analysis of Cluster Randomisation Trials in Health Research. 2000, London: Arnold Google Scholar Cornfield J: Randomization by group: a formal analysis. Am J Epidemiol. 1978, 108 (2): 100-102. PubMed Google Scholar Donner A, Brown KS, Brasher P: A methodological review of non-therapeutic intervention trials employing cluster randomization, 1979-1989. Int J Epidemiol. 1990, 19 (4): 795-800. 10.1093ije19.4.795. View Article PubMed Google Scholar Rubin DB: Inference and missing data. Biometrika. 1976, 63: 581-592. 10.1093biomet63.3.581. View Article Google Scholar Allison PD: Missing Data. 2001, SAGE Publications Inc Google Scholar Schafer JL, Olsen MK: Multiple imputation for multivariate missing-data problems: a data analysts perspective. Multivariate Behavioral Research. 1998, 33: 545-571. 10.1207s15327906mbr33045. View Article PubMed Google Scholar McArdle JJ: Structural factor analysis experiments with incomplete data. Multivariate Behavioral Research. 1994, 29: 409-454. 10.1207s15327906mbr29045. View Article PubMed Google Scholar Little RJA, Rubin DB: Statistical Analysis with missing data. 2002, New York: John Wiley, Second Google Scholar Rubin DB: Multiple Imputation for Nonresponse in Surveys. 1987, New York, NY. John Wiley amp Sons, Inc View Article Google Scholar Yi GYY, Cook RJ: Marginal Methods for Incomplete Longitudinal Data Arising in Clusters. Journal of the American Statistical Association. 2002, 97 (460): 1071-1080. 10.1198016214502388618889. View Article Google Scholar Hunsberger S, Murray D, Davis CE, Fabsitz RR: Imputation strategies for missing data in a school-based multi-centre study: the Pathways study. Stat Med. 2001, 20 (2): 305-316. 10.10021097-0258(20010130)20:2lt305::AID-SIM645gt3.0.CO2-M. View Article PubMed Google Scholar Nixon RM, Duffy SW, Fender GR: Imputation of a true endpoint from a surrogate: application to a cluster randomized controlled trial with partial information on the true endpoint. BMC Med Res Methodol. 2003, 3: 17-10.11861471-2288-3-17. View Article PubMed PubMed Central Google Scholar Green SB, Corle DK, Gail MH, Mark SD, Pee D, Freedman LS, Graubard BI, Lynn WR: Interplay between design and analysis for behavioral intervention trials with community as the unit of randomization. Am J Epidemiol. 1995, 142 (6): 587-593. PubMed Google Scholar Green SB: The advantages of community-randomized trials for evaluating lifestyle modification. Control Clin Trials. 1997, 18 (6): 506-13. 10.1016S0197-2456(97)00013-5. discussion 514-6 View Article PubMed Google Scholar Taljaard M, Donner A, Klar N: Imputation strategies for missing continuous outcomes in cluster randomized trials. Biom J. 2008, 50 (3): 329-345. 10.1002bimj.200710423. View Article PubMed Google Scholar Kenward MG, Carpenter J: Multiple imputation: current perspectives. Stat Methods Med Res. 2007, 16 (3): 199-218. 10.11770962280206075304. View Article PubMed Google Scholar Dobson AJ: An introduction to generalized linear models. 2002, Boca Raton: Chapman amp HallCRC, 2 Google Scholar Schafer JL: Analysis of Incomplete Multivariate Data. 1997, London: Chapman and Hall View Article Google Scholar SAS Publishing: SASSTAT 9.1 Users Guide: support. sasdocumentationonlinedoc91pdfsasdoc91statug7313.pdf Rubin DB, Schenker N: Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association. 1986, 81 (394): 366-374. 10.23072289225. View Article Google Scholar Ma J, Thabane L, Kaczorowski J, Chambers L, Dolovich L, Karwalajtys T, Levitt C: Comparison of Bayesian and classical methods in the analysis of cluster randomized controlled trials with a binary outcome: the Community Hypertension Assessment Trial (CHAT). BMC Med Res Methodol. 2009, 9: 37-10.11861471-2288-9-37. View Article PubMed PubMed Central Google Scholar Levin KA: Study design VII. Randomised controlled trials. Evid Based Dent. 2007, 8 (1): 22-23. 10.1038sj. ebd.6400473. View Article PubMed Google Scholar Matthews FE, Chatfield M, Freeman C, McCracken C, Brayne C, MRC CFAS: Attrition and bias in the MRC cognitive function and ageing study: an epidemiological investigation. BMC Public Health. 2004, 4: 12-10.11861471-2458-4-12. View Article PubMed PubMed Central Google Scholar Ostbye T, Steenhuis R, Wolfson C, Walton R, Hill G: Predictors of five-year mortality in older Canadians: the Canadian Study of Health and Aging. J Am Geriatr Soc. 1999, 47 (10): 1249-1254. View Article PubMed Google Scholar Viera AJ, Garrett JM: Understanding interobserver agreement: the kappa statistic. Fam Med. 2005, 37 (5): 360-363. PubMed Google Scholar Laurenceau JP, Stanley SM, Olmos-Gallo A, Baucom B, Markman HJ: Community-based prevention of marital dysfunction: multilevel modeling of a randomized effectiveness study. J Consult Clin Psychol. 2004, 72 (6): 933-943. 10.10370022-006X.72.6.933. View Article PubMed Google Scholar Shrive FM, Stuart H, Quan H, Ghali WA: Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med Res Methodol. 2006, 6: 57-10.11861471-2288-6-57. View Article PubMed PubMed Central Google Scholar Elobeid MA, Padilla MA, McVie T, Thomas O, Brock DW, Musser B, Lu K, Coffey CS, Desmond RA, St-Onge MP, Gadde KM, Heymsfield SB, Allison DB: Missing data in randomized clinical trials for weight loss: scope of the problem, state of the field, and performance of statistical methods. PLoS One. 2009, 4 (8): e6624-10.1371journal. pone.0006624. View Article PubMed PubMed Central Google Scholar McCulloch CE, Neuhaus JM: Prediction of Random Effects in Linear and Generalized Linear Models under Model Misspecification. Biometrics. Neuhaus JM, McCulloch CE: Separating between - and within-cluster covariate effects using conditional and partitioning methods. Journal of the Royal Statistical Society. 2006, 859-872. Series B, 68 Heagerty PJ, Kurland BF: Misspecified maximum likelihood estimates and generalised linear mixed models. Biometrika. 2001, 88 (4): 973-985. 10.1093biomet88.4.973. View Article Google Scholar Christopher FA: Rounding after multiple imputation with Non-binary categorical covariates. SAS Focus Session SUGI. 2004, 30: Google Scholar Horton NJ, Lipsitz SR, Parzen M: A potential for bias when rounding in multiple imputation. American Statistician. 2003, 229-232. 10.11980003130032314. 57 Li X, Mehrotra DV, Barnard J: Analysis of incomplete longitudinal binary data using multiple imputation. Stat Med. 2006, 25 (12): 2107-2124. 10.1002sim.2343. View Article PubMed Google Scholar Collins LM, Schafer JL, Kam CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods. 2001, 6 (4): 330-351. 10.10371082-989X.6.4.330. View Article PubMed Google Scholar Pre-publication history Ma et al licensee BioMed Central Ltd. 2011 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( creativecommons. orglicensesby2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Multiple Imputation in Stata: Imputing This is part four of the Multiple Imputation in Stata series. For a list of topics covered by this series, see the Introduction . This section will talk you through the details of the imputation process. Be sure youve read at least the previous section, Creating Imputation Models. so you have a sense of what issues can affect the validity of your results. Example Data To illustrate the process, well use a fabricated data set. Unlike those in the examples section, this data set is designed to have some resemblance to real world data. female (binary) race (categorical, three values) urban (binary) edu (ordered categorical, four values) exp (continuous) wage (continuous) Missingness . Each value of all the variables except female has a 10 chance of being missing completely at random, but of course in the real world we wont know that it is MCAR ahead of time. Thus we will check whether it is MCAR or MAR (MNAR cannot be checked by looking at the observed data) using the procedure outlined in Deciding to Impute : unab numvars: unab missvars: urban-wage misstable sum, gen(miss) foreach var of local missvars local covars: list numvars - var display newline(3) quotlogit missingness of var on covarsquot logit missvar covars foreach nvar of local covars display newline(3) quotttest of nvar by missingness of varquot ttest nvar, by(missvar) See the log file for results. Our goal is to regress wages on sex, race, education level, and experience. To see the quotrightquot answers, open the do file that creates the data set and examine the gen command that defines wage. Complete code for the imputation process can be found in the following do file: The imputation process creates a lot of output. Well put highlights in this page, however, a complete log file including the associated graphs can be found here: Each section of this article will have links to the relevant section of the log. Click quotbackquot in your browser to return to this page. Setting up The first step in using mi commands is to mi set your data. This is somewhat similar to svyset. tsset. or xtset. The mi set command tells Stata how it should store the additional imputations youll create. We suggest using the wide format, as it is slightly faster. On the other hand, mlong uses slightly less memory. To have Stata use the wide data structure, type: To have Stata use the mlong (marginal long) data structure, type: The wide vs. long terminology is borrowed from reshape and the structures are similar. However, they are not equivalent and you would never use reshape to change the data structure used by mi. Instead, type mi convert wide or mi convert mlong (add, clear if the data have not been saved since the last change). Most of the time you dont need to worry about how the imputations are stored: the mi commands figure out automatically how to apply whatever you do to each imputation. But if you need to manipulate the data in a way mi cant do for you, then youll need to learn about the details of the structure youre using. Youll also need to be very, very careful. If youre interested in such things (including the rarely used flong and flongsep formats) run this do file and read the comments it contains while examining the data browser to see what the data look like in each form. Registering Variables The mi commands recognize three kinds of variables: Imputed variables are variables that mi is to impute or has imputed. Regular variables are variables that mi is not to impute, either by choice or because they are not missing any values. Passive variables are variables that are completely determined by other variables. For example, log wage is determined by wage, or an indicator for obesity might be determined by a function of weight and height. Interaction terms are also passive variables, though if you use Statas interaction syntax you wont have to declare them as such. Passive variables are often problematic8212the examples on transformations. non-linearity. and interactions show how using them inappropriately can lead to biased estimates. If a passive variable is determined by regular variables, then it can be treated as a regular variable since no imputation is needed. Passive variables only have to be treated as such if they depend on imputed variables. Registering a variable tells Stata what kind of variable it is. Imputed variables must always be registered: mi register imputed varlist where varlist should be replaced by the actual list of variables to be imputed. Regular variables often dont have to be registered, but its a good idea: mi register regular varlist Passive variables must be registered: mi register passive varlist However, passive variables are more often created after imputing. Do so with mi passive and theyll be registered as passive automatically. In our example data, all the variables except female need to be imputed. The appropriate mi register command is: mi register imputed race-wage (Note that you cannot use as your varlist even if you have to impute all your variables, because that would include the system variables added by mi set to keep track of the imputation structure.) Registering female as regular is optional, but a good idea: mi register regular female Checking the Imputation Model Based on the types of the variables, the obvious imputation methods are: race (categorical, three values): mlogit urban (binary): logit edu (ordered categorical, four values): ologit exp (continuous): regress wage (continuous): regress female does not need to be imputed, but should be included in the imputation models both because it is in the analysis model and because its likely to be relevant. Before proceeding to impute we will check each of the imputation models. Always run each of your imputation models individually, outside the mi impute chained context, to see if they converge and (insofar as it is possible) verify that they are specified correctly. Code to run each of these models is: mlogit race i. urban exp wage i. edu i. female logit urban i. race exp wage i. edu i. female ologit edu i. urban i. race exp wage i. female regress exp i. urban i. race wage i. edu i. female regress wage i. urban i. race exp i. edu i. female Note that when categorical variables (ordered or not) appear as covariates i. expands them into sets of indicator variables. As well see later, the output of the mi impute chained command includes the commands for the individual models it runs. Thus a useful shortcut, especially if you have a lot of variables to impute, is to set up your mi impute chained command with the dryrun option to prevent it from doing any actual imputing, run it, and then copy the commands from the output into your do file for testing. Convergence Problems The first thing to note is that all of these models run successfully. Complex models like mlogit may fail to converge if you have large numbers of categorical variables, because that often leads to small cell sizes. To pin down the cause of the problem, remove most of the variables, make sure the model works with whats left, and then add variables back one at a time or in small groups until it stops working. With some experimentation you should be able to identify the problem variable or combination of variables. At that point youll have to decide if you can combine categories or drop variables or make other changes in order to create a workable model. Prefect Prediction Perfect prediction is another problem to note. The imputation process cannot simply drop the perfectly predicted observations the way logit can. You could drop them before imputing, but that seems to defeat the purpose of multiple imputation. The alternative is to add the augment (or just aug ) option to the affected methods. This tells mi impute chained to use the quotaugmented regressionquot approach, which adds fake observations with very low weights in such a way that they have a negligible effect on the results but prevent perfect prediction. For details see the section quotThe issue of perfect prediction during imputation of categorical dataquot in the Stata MI documentation. Checking for Misspecification You should also try to evaluate whether the models are specified correctly. A full discussion of how to determine whether a regression model is specified correctly or not is well beyond the scope of this article, but use whatever tools you find appropriate. Here are some examples: Residual vs. Fitted Value Plots For continuous variables, residual vs. fitted value plots (easily done with rvfplot ) can be useful8212several of the examples use them to detect problems. Consider the plot for experience: regress exp i. urban i. race wage i. edu i. female rvfplot Note how a number of points are clustered along a line in the lower left, and no points are below it: This reflects the constraint that experience cannot be less than zero, which means that the fitted values must always be greater than or equal to the residuals, or alternatively that the residuals must be greater than or equal to the negative of the fitted values. (If the graph had the same scale on both axes, the constraint line would be a 45 degree line.) If all the points were below a similar line rather than above it, this would tell you that there was an upper bound on the variable rather than a lower bound. The y-intercept of the constraint line tells you the limit in either case. You can also have both a lower bound and an upper bound, putting all the points in a band between them. The quotobviousquot model, regress. is inappropriate for experience because it wont apply this constraint. Its also inappropriate for wages for the same reason. Alternatives include truncreg, ll(0) and pmm (well use pmm ). Adding Interactions In this example, it seems plausible that the relationships between variables may vary between race, gender, and urbanrural groups. Thus one way to check for misspecification is to add interaction terms to the models and see whether they turn out to be important. For example, well compare the obvious model: regress exp i. race wage i. edu i. urban i. female with one that includes interactions: regress exp (i. race i. urban i. female)(c. wage i. edu) Well run similar comparisons for the models of the other variables. This creates a great deal of output, so see the log file for results. Interactions between female and other variables are significant in the models for exp. wage. edu. and urban. There are a few significant interactions between race or urban and other variables, but not nearly as many (and keep in mind that with this many coefficients wed expect some false positives using a significance level of .05). Well thus impute the men and women separately. This is an especially good option for this data set because female is never missing. If it were, wed have to drop those observations which are missing female because they could not be placed in one group or the other. In the imputation command this means adding the by(female) option. When testing models, it means starting the commands with the by female: prefix (and removing female from the lists of covariates). The improved imputation models are thus: bysort female: reg exp i. urban i. race wage i. edu by female: logit urban exp i. race wage i. edu by female: mlogit race exp i. urban wage i. edu by female: reg wage exp i. urban i. race i. edu by female: ologit edu exp i. urban i. race wage pmm itself cannot be run outside the imputation context, but since its based on regression you can use regular regression to test it. These models should be tested again, but well omit that process. The basic syntax for mi impute chained is: mi impute chained ( method1 ) varlist1 ( method2 ) varlist2. regvars Each method specifies the method to be used for imputing the following varlist The possibilities for method are regress. pmm. truncreg. intreg. logit. ologit. mlogit. poisson. and nbreg. regvars is a list of regular variables to be used as covariates in the imputation models but not imputed (there may not be any). The basic options are: add( N ) rseed( R ) savetrace( tracefile. replace) N is the number of imputations to be added to the data set. R is the seed to be used for the random number generator8212if you do not set this youll get slightly different imputations each time the command is run. The tracefile is a dataset in which mi impute chained will store information about the imputation process. Well use this dataset to check for convergence. Options that are relevant to a particular method go with the method, inside the parentheses but following a comma (e. g. (mlogit, aug) ). Options that are relevant to the imputation process as a whole (like by(female) ) go at the end, after the comma. For our example, the command would be: mi impute chained (logit) urban (mlogit) race (ologit) edu (pmm) exp wage, add(5) rseed(4409) by(female) Note that this does not include a savetrace() option. As of this writing, by() and savetrace() cannot be used at the same time, presumably because it would require one trace file for each by group. Stata is aware of this problem and we hope this will be changed soon. For purposes of this article, well remove the by() option when it comes time to illustrate use of the trace file. If this problem comes up in your research, talk to us about work-arounds. Choosing the Number of Imputations There is some disagreement among authorities about how many imputations are sufficient. Some say 3-10 in almost all circumstances, the Stata documentation suggests at least 20, while White, Royston, and Wood argue that the number of imputations should be roughly equal to the percentage of cases with missing values. However, we are not aware of any argument that increasing the number of imputations ever causes problems (just that the marginal benefit of another imputation asymptotically approaches zero). Increasing the number of imputations in your analysis takes essentially no work on your part. Just change the number in the add() option to something bigger. On the other hand, it can be a lot of work for the computer8212multiple imputation has introduced many researchers into the world of jobs that take hours or days to run. You can generally assume that the amount of time required will be proportional to the number of imputations used (e. g. if a do file takes two hours to run with five imputations, it will probably take about four hours to run with ten imputations). So heres our suggestion: Start with five imputations (the low end of whats broadly considered legitimate). Work on your research project until youre reasonably confident you have the analysis in its final form. Be sure to do everything with do files so you can run it again at will. Note how long the process takes, from imputation to final analysis. Consider how much time you have available and decide how many imputations you can afford to run, using the rule of thumb that time required is proportional to the number of imputations. If possible, make the number of imputations roughly equal to the percentage of cases with missing data (a high end estimate of whats required). Allow time to recover if things to go wrong, as they generally do. Increase the number of imputations in your do file and start it. Do something else while the do file runs, like write your paper. Adding imputations shouldnt change your results significantly8212and in the unlikely event that they do, consider yourself lucky to have found that out before publishing. Speeding up the Imputation Process Multiple imputation has introduced many researchers into the world of jobs that take hours, days, or even weeks to run. Usually its not worth spending your time to make Stata code run faster, but multiple imputation can be an exception. Use the fastest computer available to you. For SSCC members that means learning to run jobs on Linstat, the SSCCs Linux computing cluster. Linux is not as difficult as you may think8212Using Linstat has instructions. Multiple imputation involves more reading and writing to disk than most Stata commands. Sometimes this includes writing temporary files in the current working directory. Use the fastest disk space available to you, both for your data set and for the working directory. In general local disk space will be faster than network disk space, and on Linstat ramdisk (a quotdirectoryquot that is actually stored in RAM) will be faster than local disk space. On the other hand, you would not want to permanently store data sets anywhere but network disk space. So consider having your do file do something like the following: Windows (Winstat or your own PC) This applies when youre using imputed data as well. If your data set is large enough that working with it after imputation is slow, the above procedure may help. Checking for Convergence MICE is an iterative process. In each iteration, mi impute chained first estimates the imputation model, using both the observed data and the imputed data from the previous iteration. It then draws new imputed values from the resulting distributions. Note that as a result, each iteration has some autocorrelation with the previous imputation. The first iteration must be a special case: in it, mi impute chained first estimates the imputation model for the variable with the fewest missing values based only on the observed data and draws imputed values for that variable. It then estimates the model for the variable with the next fewest missing values, using both the observed values and the imputed values of the first variable, and proceeds similarly for the rest of the variables. Thus the first iteration is often atypical, and because iterations are correlated it can make subsequent iterations atypical as well. To avoid this, mi impute chained by default goes through ten iterations for each imputed data set you request, saving only the results of the tenth iteration. The first nine iterations are called the burn-in period. Normally this is plenty of time for the effects of the first iteration to become insignificant and for the process to converge to a stationary state. However, you should check for convergence and increase the number of iterations if necessary to ensure it using the burnin() option. To do so, examine the trace file saved by mi impute chained. It contains the mean and standard deviation of each imputed variable in each iteration. These will vary randomly, but they should not show any trend. An easy way to check is with tsline. but it requires reshaping the data first. Our preferred imputation model uses by(). so it cannot save a trace file. Thus well remove by() for the moment. Well also increase the burnin() option to 100 so its easier to see what a stable trace looks like. Well then use reshape and tsline to check for convergence: preserve mi impute chained (logit) urban (mlogit) race (ologit) edu (pmm) exp wage female, add(5) rseed(88) savetrace(extrace, replace) burnin(100) use extrace, replace reshape wide mean sd, i(iter) j(m) tsset iter tsline expmean, title(quotMean of Imputed Values of Experiencequot) note(quotEach line is for one imputationquot) legend(off) graph export conv1.png, replace tsline expsd, title(quotStandard Deviation of Imputed Values of Experiencequot) note(quotEach line is for one imputationquot) legend(off) graph export conv2.png, replace restore The resulting graphs do not show any obvious problems: If you do see signs that the process may not have converged after the default ten iterations, increase the number of iterations performed before saving imputed values with the burnin() option. If convergence is never achieved this indicates a problem with the imputation model. Checking the Imputed Values After imputing, you should check to see if the imputed data resemble the observed data. Unfortunately theres no formal test to determine whats quotclose enough. quot Of course if the data are MAR but not MCAR, the imputed data should be systematically different from the observed data. Ironically, the fewer missing values you have to impute, the more variation youll see between the imputed data and the observed data (and between imputations). For binary and categorical variables, compare frequency tables. For continuous variables, comparing means and standard deviations is a good starting point, but you should look at the overall shape of the distribution as well. For that we suggest kernel density graphs or perhaps histograms. Look at each imputation separately rather than pooling all the imputed values so you can see if any one of them went wrong. The mi xeq: prefix tell Stata to apply the subsequent command to each imputation individually. It also applies to the original data, the quotzeroth imputation. quot Thus: mi xeq: tab race will give you six frequency tables: one for the original data, and one for each of the five imputations. However, we want to compare the observed data to just the imputed data, not the entire data set. This requires adding an if condition to the tab commands for the imputations, but not the observed data. Add a number or numlist to have mi xeq act on particular imputations: mi xeq 0: tab race mi xeq 15: tab race if missrace This creates frequency tables for the observed values of race and then the imputed values in all five imputations. If you have a significant number of variables to examine you can easily loop over them: foreach var of varlist urban race edu mi xeq 0: tab var mi xeq 15: tab var if missvar For results see the log file . Running summary statistics on continuous variables follows the same process, but creating kernel density graphs adds a complication: you need to either save the graphs or give yourself a chance to look at them. mi xeq: can carry out multiple commands for each imputation: just place them all in one line with a semicolon ( ) at the end of each. (This will not work if youve changed the general end-of-command delimiter to a semicolon.) The sleep command tells Stata to pause for a specified period, measured in milliseconds. mi xeq 0: kdensity wage sleep 1000 mi xeq 15: kdensity wage if missvar sleep 1000 Again, this can all be automated: foreach var of varlist wage exp mi xeq 0: sum var mi xeq 15: sum var if missvar mi xeq 0: kdensity var sleep 1000 mi xeq 15: kdensity var if missvar sleep 1000 Saving the graphs turns out to be a bit trickier, because you need to give the graph from each imputation a different file name. Unfortunately you cannot access the imputation number within mi xeq. However, you can do a forvalues loop over imputation numbers, then have mi xeq act on each of them: forval i15 mi xeq i: kdensity exp if missexp graph export expi. png, replace Integrating this with the previous version gives: foreach var of varlist wage exp mi xeq 0: sum var mi xeq 15: sum var if missvar mi xeq 0: kdensity var graph export chkvar0.png, replace forval i15 mi xeq i: kdensity var if missvar graph export chkvari. png, replace For results, see the log file . Its troublesome that in all imputations the mean of the imputed values of wage is higher than the mean of the observed values of wage. and the mean of the imputed values of exp is lower than the mean of the observed values of exp. We did not find evidence that the data is MAR but not MCAR, so wed expect the means of the imputed data to be clustered around the means of the observed data. There is no formal test to tell us definitively whether this is a problem or not. However, it should raise suspicions, and if the final results with these imputed data are different from the results of complete cases analysis, it raises the question of whether the difference is due to problems with the imputation model. Last Revised: 8232012NOTICE: The IDRE Statistical consulting group will be migrating the website to the WordPress CMS in February to facilitate maintenance and creation of new content. Einige unserer älteren Seiten werden entfernt oder archiviert, so dass sie nicht mehr gepflegt werden. Wir werden versuchen, Umleitungen zu pflegen, damit die alten URLs weiterhin so gut funktionieren wie möglich. Welcome to the Institute for Digital Research and Education Help the Stat Consulting Group by giving a gift Statistical Computing Seminars Missing Data in SAS Part 1 Note: A PowerPoint presentation of this webpage can be downloaded here . Introduction Missing data is a common issue, and more often than not, we deal with the matter of missing data in an ad hoc fashion. The purpose of this seminar is to discuss commonly used techniques for handling missing data and common issues that could arise when these techniques are used. In particular, we will focus on the one of the most popular methods, multiple imputation. We are not advocating in favor of any one technique to handle missing data and depending on the type of data and model you will be using, other techniques such as direct maximum likelihood may better serve your needs. We have chosen to explore multiple imputation through an examination of the data, a careful consideration of the assumptions needed to implement this method and a clear understanding of the analytic model to be estimated. We hope this seminar will help you to better understand the scope of the issues you might face when dealing with missing data using this method. The data set hsbmar. sas7bdat which is based on hsb2.sas7bdat used for this seminar can be downloaded in following the link. The SAS code for this seminar is developed u sing SAS 9.4 and SASSTAT 13.1. So me of the variables have value labels (formats) associated with them. Here is the setup for reading the value labels correctly. Goals of statistical analysis with missing data: Minimize bias Maximize use of available information Obtain appropriate estimates of uncertainty Exploring missing data mechanisms The missing data mechanism describes the process that is believed to have generated the missing values. Missing data mechanisms generally fall into one of three main categories. There are precise technical definitions for these terms in the literature the following explanation necessarily contains simplifications. Missing completely at random (MCAR) A variable is missing completely at random, if neither the variables in the dataset nor the unobserved value of the variable itself predict whether a value will be missing. Missing completely at random is a fairly strong assumption and may be relatively rare. One relatively common situation in which data are missing completely at random occurs when a subset of cases is randomly selected to undergo additional measurement, this is sometimes referred to as quotplanned missing. quot For example, in some health surveys, some subjects are randomly selected to undergo more extensive physical examination therefore only a subset of participants will have complete information for these variables. Missing completely at random also allow for missing on one variable to be related to missing on another, e. g. var1 is missing whenever var2 is missing. For example, a husband and wife are both missing information on height. A variable is said to be missing at random if other variables (but not the variable itself) in the dataset can be used to predict missingness on a given variable. For example, in surveys, men may be more likely to decline to answer some questions than women (i. e. gender predicts missingness on another variable). MAR is a less restrictive assumption than MCAR. Under this assumption the probability of missingness does not depend on the true values after controlling for the observed variables. MAR is also related to ignorability. The missing data mechanism is said be ignorable if it is missing at random and the probability of a missingness does not depend on the missing information itself. The assum ption of ignorability is needed for optimal estimation of missing information and is a required assumption for both of the missing data techniques we will discuss. Missing not at random (MNAR) Finally, data are said to be missing not at random if the value of the unobserved variable itself predicts missingness. A classic example of this is income. Individuals with very high incomes are more likely to decline to answer questions about their income than individuals with more moderate incomes. An understanding of the missing data mechanism(s) present in your data is important because different types of missing data require different treatments. When data are missing completely at random, analyzing only the complete cases will not result in biased parameter estimates (e. g. regression coefficients). However, the sample size for an analysis can be substantially reduced, leading to larger standard errors. In contrast, analyzing only complete cases for data that are either missing at random, or missing not at random can lead to biased parameter estimates. Multiple imputation and other modern methods such as direct maximum likelihood generally assumes that the data are at least MAR, meaning that this procedure can also be used on data that are missing completely at random. Statistical models have also been developed for modeling the MNAR processes however, these model are beyond the scope of this seminar. For more information on missing data mechanisms please see: Allison, 2002 Enders, 2010 Little amp Rubin, 2002 Rubin, 1976 Schafer amp Graham, 2002 Full data: Below is a regression model predicting read using the complete data set ( hsb2 ) used to create hsbmar . We will use these results for comparison. Common techniques for dealing with missing data In this section, we are going to discuss some common techniques for dealing with missing data and briefly discuss their limitations. Complete case analysis (listwise deletion) Available case analysis (pairwise deletion) Mean Imputation Single Imputation Stochastic Imputation 1. Complete Case Analysis: This methods involves deleting cases in a particular dataset that are missing data on any variable of interest. It is a common technique because it is easy to implement and works with any type of analysis. Below we look at some of the descriptive statistics of the data set hsbmar . which contains test scores, as well as demographic and school information for 200 high school students. Note that although the dataset contains 200 cases, six of the variables have fewer than 200 observatio ns. The missing information varies between 4.5 ( read ) and 9 ( female and prog ) of cases depending on the variable. This doe snt seem like a lot of missing data, so we might be inclined to try to analyze the observed data as they are, a strategy sometimes referred to as complete case analysis. Below is a regression model where the dependent variable read is regressed on write . math, female and prog . Notice that the default behavior of proc glm is complete case analysis (also referred to as listwise deletion). Looking at the output, we see that only 130 cases were used in the analysis in other words, more than one third of the cases in our dataset (70200) were excluded from the analysis because of missing data. The reduction in sample size (and statistical power) alone might be considered a problem, but complete case analysis can also lead to biased estimates. Specifically you will see below that the estimates for the intercept, write, math and prog are different from the regression model on the complete data. Also, the standard errors are all larger due to the smaller sample size, resulting in the parameter estimate for female almost becoming non-significant. Unfortunately, unless the mechanism of missing data is MCAR, this method will introduce bias into the parameter estimates. 2. Available Case Analysis: This method involves estimating means, variances and covariances based on all available non-missing cases. Meaning that a covariance (or correlation) matrix is computed where each element is based on the full set of cases with non-missing values for each pair of variables. This method became popular because the loss of power due to missing information is not as substantial as with complete case analysis. Below we look at the pairwise correlations between the outcome read and each of the predictors, write, prog, female, and math. Depending on the pairwise comparison examined, the sample size will change based on the amount of missing present in one or both variables. Because proc glm does not accept covariance matrices as data input, the following example will be done with proc reg . This will require us to create dummy variables for our categorical predictor prog since there is no class statement in proc reg . By default proc corr uses pairwise deletion to estimate the correlation table. The options on the proc corr statement, cov and outp . will output a variancecovariance matrix based on pairwise deletion that will be used in the subsequent regression model The first thing you should see is the note that SAS prints to your log file stating quotN not equal across variables in data set. This may not be appropriate. The smallest value will be used. quot. One of the main drawbacks of this method is no consistent sample size. You will also notice that the parameter estimates presented here are different than the estimates obtained from analysis on the full data and the listwise deletion approach. For instance, the variable female had an estimated effect of -2.7 with the full data but was attenuated to -1.85 for the available case analysis. Unless the mechanism of missing data is MCAR, this method will introduce bias into the parameter estimates. Therefore, this method is not recommended. 3. Unconditional Mean Imputation: This methods involves replacing the missing values for an individual variable with it39s overall estimated mean from the available cases. While this is a simple and easily implemented method for dealing with missing values it has some unfortunate consequences. The most important problem with mean imputation, also called mean substitution, is that it will result in an artificial reduction in variability due to the fact you are imputing values at the center of the variable39s distribution. This also has the unintended consequence of changing the magnitude of correlations between the imputed variable and other variables. We can demonstrate this phenomenon in our data. Below are tables of the means and standard deviations of the four variables in our regression model BEFORE and AFTER a mean imputation as well as their corresponding correlation matrices. We will again utilize the prog dummy variables we created previously. You will notice that there is very little change in the mean (as you would expect) however, the standard deviation is noticeably lower after substituting in mean values for the observations with missing information. This is because you reduce the variability in your variables when you impute everyone at the mean. Moreover, you can see the table of quotPearson Correlation Coefficientsquot that the correlation between each of our predictors of interest ( write . math . female . and prog ) as well as between predictors and the outcome read have now be attenuated. Therefore, regression models that seek to estimate the associations between these variables will also see their effects weakened. 4 Single or Deterministic Imputation : A slightly more sophisticated type of imputation is a regressionconditional mean imputation, which r eplaces missing values with predicted scores from a regression equation. The strength of this approach is that it uses complete information to impute values. The drawback here is that all your predicted values will fall directly on the regression line once again decreasing variability, just not as much as with unconditional mean imputation. Moreover, statistical models cannot distinguish between observed and imputed values and therefore do not incorporate into the model the error or uncertainly associated with that imputedva lue. Additionally, you will see that this method will also inflate the associations between variables because it imputes values that are perfectly correlated with one another. Unfortunately, even under the assumption of MCAR, regression imputation will upwardly bias correlations and R-squared statistics. Further discussion and an example of this can be found in Craig Enders book quotApplied Missing Data Analysisquot (2010). 5 Stochastic Imputation : In recognition of the problems with regression imputation and the reduced variability associated with this approach, researchers developed a technique to incorporate or quotadd backquot lost variability. A residual term, that is randomly drawn from a normal distribution with mean zero and variance equal to the residual variance from the regression model, is added to the predicted scores from the regression imputation thus restoring some of the lost variability. This method is superior to the previous methods as it will produce unbiased coefficient estimates under MAR. However, the standard errors produced during regression estimation while less biased then the single imputation approach, will still be attenuated. While you might be inclined to use one of these more traditional methods, consider this statement: quotMissing data analyses are difficult because there is no inherently correct methodological procedure. In many (if not most) situations, blindly applying maximum likelihood estimation or multiple imputation will likely lead to a more accurate set of estimates than using one of the previously mentioned missing data handling techniquesquot (p.344, Applied Missing Data Analysis, 2010). Multiple Imputation Multiple imputation is essentially an iterative form of stochastic imputation. However, instead of filling in a single value, the distribution of the observed data is used to estimate multiple values that reflect the uncertainty around the true value. These values are then used in the analysis of interest, such as in a OLS model, and the results combined. Each imputed value includes a random component whose magnitude reflects the extent to which other variables in the imputation model cannot predict it39s true values (Johnson and Young, 2011 White et al, 2010). Thus, building into the imputed values a level of uncertainty around the quottruthfulnessquot of the imputed values. A common misconception of missing data methods is the assumption that imputed values should represent quotrealquot values. The purpose when addressing missing data is to correctly reproduce the variancecovariance matrix we would have observed had our data not had any missing information. MI has three basic phases: 1. Imputation or Fill-in Phase: The missing data are filled in with estimated values and a complete data set is created. This process of fill-in is repeated m times. 2. Analysis Phase: Each of the m complete data sets is then analyzed using a statistical method of interest (e. g. linear regression). 3. Pooling Phase: The parameter estimates (e. g. coefficients and standard errors) obtained from each analyzed data set are then combined for inference. The imputation method you choose depends on the pattern of missing information as well as the type of variable(s) with missing information. Imputation Model, Analytic Model and Compatibility : When developing your imputation model, it is important to assess if your imputation model is quotcongenialquot or consistent with your analytic model. Consistency means that your imputation model includes (at the very least) the same variables that are in your analytic or estimation model. This includes any tr ansformations to variables that will be needed to assess your hypothesis of interest. This can include log transformations, interaction terms, or recodes of a continuous variable into a categorical form, if that is how it will be used in later analysis. The reason for this relates back to the earlier comments about the purpose of multiple imputation. Since we are trying to reproduce the proper variancecovariance matrix for estimation, all relationships between our analytic variables should be represented and estimated simultaneously. Otherwise, you are imputing values assuming they have a correlation of zero with the variables you did not include in your imputation model. This would result in underestimating the association between parameters of interest in your analysis and a loss of power to detect properties of your data that may be of interest such as non-linearities and statistical interactions. For additional reading on this particular topic see: 1. von Hippel, 2009 2. von Hippel, 2013 3. White et al. 2010 Preparing to conduct MI: First step: Examine the number and proportion of missing values among your variables of interest. The proc means procedure in SAS has an option called nmiss that will count the number of missing values for the variables specified. You can also create missing data flags or indicator variables for the missing information to assess the proportion of missingness. This quotMissing Data Patternsquot table can be requested without actually performing a full imputation by specifying the option nimpute0 (specifying zero imputed datasets to be created) on the proc mi statement line. Each quotgroupquot represents a set of observations in the data set that share the same pattern of missing information. For example, group 1 represents the 130 observations in the data that have complete information on all 5 variables of interest. This procedure also provides means for each variable for this group. You can see that there are a total of 12 patterns for the specified variables. The estimated means associated with each missing data pattern can also give you an indication of whether the assumption MCAR or MAR is appropriate. If you begin to observe that those with certain missing data patterns appear to have a very different distribution of values, this is an indication that you data may not be MCAR. Moreover, depending on the nature of the data, you may recognize patterns such as monotone missing which can be observed in longitudinal data when an individual drops out at a particular time point and therefore all data after that is subsequently missing. Additionally, you may identify skip patterns that were missed in your original review of the data that should then be dealt with before moving forward with the multiple imputation. Third Step: If necessary, identify potential auxiliary variables Auxiliary variables are variables in your data set that are either correlated with a missing variable(s) (the recommendation is r gt 0.4) or are believed to be associated with missingness. These are factors that are not of particular interest in your analytic model. but they are added to the imputation model to increase power andor to help make the assumpti on of MAR more plausible. These variables have been found to improve the quality of imputed values generate from multiple imputation. Moreover, research has demonstrated their particular importance when imputing a dependent variable andor when you have variables with a high proportion of missing information (Johnson and Young, 2011 Young and Johnson, 2010 Enders. 2010). You may a priori know of several variables you believe would make good auxiliary variables based on your knowledge of the data and subject matter. Additionally, a good review of the literature can often help identify them as well. However, if your not sure what variables in the data would be potential candidates (this is often the case when conducting analysis secondary data analysis), you can uses some simple methods to help identify potential candidates. One way to identify these variables is by examining associations between write, read, female, and math with other variables in the dataset. For example, let39s take a look at the correlation matrix between our 4 variables of interest and two other test score variables science and socst . Science and socst both appear to be a good auxiliary because they are well correlated (r gt0.4) with all the other test score variables of interest. You will also notice that they are not well correlated with female . A good auxiliary does not have to be correlated with every variable to be used. You will also notice that science also has missing information of it39s own. Additionally, a good auxiliary is not required to have complete information to be valuable. They can have missing and still be effective in reducing bias (Enders, 2010). One area, this is still under active research, is whether it is beneficial to include a variable as an auxiliary if it does not pass the 0.4 correlation threshold with any of the variables to be imputed. Some researchers believe that including these types o f items introduces unnecessary error into the imputation model (Allison, 2012), while others do not believe that there is any harm in this practice (Ender, 2010). Thus. we leave it up to you as the researcher to use your best judgment. Good auxiliary variables can also be correlates or predictors of missingness. Let39s use the missing data flags we made earlier to help us identify some variables that may be good correlates. We examine if our potential auxiliary variable socst also appears to predict missingness. Below are a set of t-tests to test if the mean socst or science scores differ significantly between those with missing information and those without. The only significant difference was found when examining missingness on math with socst. Above you can see that the mean socst score is significantly lower among the respondents who are missing on math. This suggests that socst is a potential correlate of missingness (Enders, 2010) and may help us satisfy the MAR assumption for multiple imputation by including it in our imputation model. Example 1: MI using multivariate normal distribution (MVN): When choosing to impute one or many variables, one of the first decisions you will make is the type of distribution under which you wa nt to impute your variable(s). One method available in SAS uses Markov Chain Monte Carlo (MCMC) which assumes that all the variables in the imputation model have a joint multivariate normal distribution. This is probably the most common parametric approach for multiple imputation. The specific algorithm used is called the data augmentation (DA) algorithm, which belongs to the family of MCMC procedures. The algorithm fills in missing data by drawing from a conditional distribution, in this case a multivariate normal, of the missing data given the observed data. In most cases, simul ation studies have shown that assuming a MVN distribution leads to reliable estimates even when the normality assumption is violated given a sufficient sample size (Demirtas et al. 2008 KJ Lee, 2010). Ho wever, biased estimates have been observed when the sample size is relatively small and the fraction of missing information is high. Note: Since we are using a multivariate normal distribution for imputation, decimal and negative values are possible. These values are not a problem for estimation however, we will need to create dummy variables for the nominal categorical variables so the parameter estiamtes for each level can be interpreted. Imputation in SAS requires 3 procedures. The first is proc mi where the user specifies the imputation model to be used and the number of imputed datasets to be created. The second procedure runs the analytic model of interest (here it is a linear regression using proc glm ) within each of the imputed datasets. The third step runs a procedure call proc mianalyze which combines all the estimates (coefficients and standard errors) across all the imputed datasets and outputs one set of parameter estimates for the model of interest. On the proc mi procedure line we can use the nimpute option to specify the number of imputations to be performed. The imputed datasets will be outputted using the out option, and stored appended or quotstackedquot together in a dataset called quotmimvnquot. An indicator variables called imputation is automatically created by the procedure to number each new imputed dataset. After the var statement, all the variables for the imputation model are specified including all the variables in the analytic model as well as any auxiliary variables. The option seed is not required, but since MI is designed to be a random process, setting a seed will allow you to obtain the same imputed dataset each time. This estimates the linear regression model for each imputed dataset individually using the by statement and the indicator variable created previously. You will observe in the Results Viewer, that SAS outputs the parameter estimates for each of the 10 imputations. The output statement stores the parameter estimates from the regression model in the dataset named quotamvn. quot This dataset will be used in the next step of the process, the pooling phase. Proc mianalyze uses the dataset quotamvnquot that contains the parameter estimates and associated covariance matrices for each imputation. The variancecovariance matrix is needed to estimate the standard errors. This step combines the parameter estimates into a single set of statistics that appropriately reflect the uncertainty associated with the imputed values. The coefficients are simply just an arithmetic mean of the individual coefficients estimated for each of the 10 regression models. Averaging the parameter estimates dampens the variation thus increasing efficiency and decreasing sampling variation. Estimation of the standard error for each variable is little more complicated and will be discussed in the next section. If you compare these estimates to those from the complete data you will observe that they are, in general, quite comparable. The variables write female and math . are significant in both sets of data. You will also observe a small inflation in the standard errors, which is to be expected since the multiple imputation process is designed to build additional uncertainty into our estimates. 2. Imputation Diagnostics: Above the quotParameter Estimatesquot table in the SAS output above you will see a table called quotVariance Informationquot. It is important to examine the output from proc mianalyze, as several pieces of the information can be used to assess how well the imputation performed. Below we discuss each piece: Variance Between (V B ): This is a measure of the variability in the parameter estimates (coefficients) obtained from the 10 imputed datasets For example, if you took all 10 of the parameter estimates for write and calculated the variance this would equal V B 0.000262. This variability estimates the additional variation (uncertainty) that results from missing data. Variance Within (V W ): This is simply the arithmetic mean of the sampling variances (SE) from each of the 10 imputed datasets. For example, if you squared the standard errors for write for all 10 imputations and then divided by 10, this would equal, this would equal V w 0.006014. This estimates the sampling variability that we would have expected had there been no missing data. Variance Total (V T ): The primary usefulness of MI comes from how the total variance is estimated. T he total variance is sum of multiple sources of variance. While regression coefficients are just averaged across imputations, Rubin39s formula (Rubin, 1 987) p artitions variance into quotwithin imputationquot capturing the expected uncertainty and quotbetween imputationquot capturing the estimation variability due to missing information (Graham, 2007 White et al. 2010). The total variance is the sum of 3 sources of variance. The within, the between and an additional source of sampling variance. For example, the total variance for the variable write would be calcualted like this: V B V w V B m 0.000262 0.006014 0.00026210 0.006302 The additional sampling variance is literally the variance between divided by m . This value represents the sampling error associated with the overall or average coefficient estimates. It is used as a correction factor for using a specific number of imputations. This value becomes small er, the more imputations are conducted. The idea being that the larger the number of imputations, the more precise the parameter estimates will be. Bottom line: The main difference between multiple imputation and other single imputation methods, is in the estimation of the variances. The SE39s for each parameter estimate are the square root of it39s V T . Degrees of Freedom (DF): Unlike analysis with non-imputed data, sample size does not directly influence the estimate of DF. DF actually continues to increase as the number of imputations increase. The standard formula used to calculate DF can result in fractional estimates as well as estimates that far exceed the DF that would had resulted had the data been complete. By default the DF infinity. Note: Starting is SAS v.8, a formula to adjust for the problem of inflated DF has been implemented (Barnard and Rubin, 1999). Use the EDF option on the proc mianalyze line to indicate to SAS what the proper adjusted DF. Bottom line: The standard formula assumes that the estimator has a normal distribution, i. e. a t-distribution with infinite degrees of freedom. In large samples this is not usually an issue but can be with smaller sample sizes. In that case, the corrected formula should be used (Lipsitz et al. 2002). Relative Increases in Variance (RIVRVI): Proportional increase in total sampling variance that is due to missing information (V B V B m V W ). For example, the RVI for write is 0.048, this means that the estimated sampling variance for write is 4.8 larger than its sampling variance would have been had the data on write been complete. Bottom line: Variables with large amounts of missing andor that are weakly correlated with other variables in the imputation model will tend to have high RVI39s. Fraction of Missing Information (FMI): Is directly related to RVI. Proportion of the total sampling variance that is due to missing data (V B V B m V T ) . It39s estimated based on the percentage missing for a particular variable and how correlated this variable is with other variables in the imputation model. The interpretation is similar to an R-squared. So an FMI of 0.046 for write means that 4.6 of the total sampling variance is attributable to missing data. The accuracy of the estimate of FMI increases as the number imputation increases because varaince estimates become more stable. This especially important in the presence of a variable(s) with a high proportion of missing information. If convergence of your imputation model is slow, examine the FMI estimates for each variables in your imputation model. A high FMI can indicate a problematic variable. Bottom line: If FMI is high for any particular variable(s) then consider increasing the number of imputations. A good rule of thumb is to have the number imputations (at least) equal the highest FMI percentage. Relative Efficiency: The relative efficiency (RE) of an imputation (how well the true population parameters are estimated) is related to both the amount of missing information as well as the number ( m) of imputations performed. When the amount of missing information is very low then efficiency may be achieved by only performing a few imputations (the minimum number given in most of the literature is 5). However when there is high amount of missing information, more imputations are typically necessary to achieve adequate efficiency for parameter estimates. You can obtain relatively good efficiency even with a small number of m. However, this does not mean that the standard errors will be well estimated well. More imputations are often necessary for proper standard erro r estimation as the variability between imputed datasets incorporate the necessary amount of uncertainty around the imputed values. The direct relationship between RE, m and the FMI is: 1(1FMI m ) . This formula represent the RE of using m imputation versus the infinte number of imputations. To get an idea of what this looks like practically, take a look at the figure below from the SAS documentation where m is the number of imputations and lambda is the FMI. Bottom line: It may appear that you can get good RE with a few imputations however, it often takes more imputations to get good estimates of the variances than good estimates of parameters like means or regression coefficients. After performing an imputation it is also useful to look at means, frequencies and box plots comparing observed and imputed values to assess if the range appears reasonable. You may also want to examine plots of residuals and outliers for each imputed dataset individually. If anomalies are evident in only a small number of imputations then this indicates a problem with the imputation model (White et al, 2010). You should also assess convergence of your imputation model. This should be done for different imputed variables, but specifically for those variables with a high proportion of missing (e. g. high FMI). Convergence of the proc mi procedure means that DA algorithm has reached an appropriate stationary posterior distribution. Convergence for each imputed variable can be assessed using trace plots. These plots can be requested on the mcmc statement line in the proc mi procedure. Long-term trends in trace plots and high serial dependence are indicative of a slow convergence to stationarity. A stationary process has a mean and variance that do not change over time. By default SAS will provide a trace plots of estimates for the means for each variable but you can also ask for these for the standard deviation as well. You can take a look at examples of good and bad trace plots in the SAS users guide section on quotAssessing Markov Chain Convergence quot. Above is an example of a trace plot for mea n social studies score. There are two main things you want to note in a trace plot. First, assess whether the algorithm appeared to reach a stable posterior distribution by examining the plot to see if the mean remains relatively constant and that there appears to be an absence of any sort of trend (indicating a sufficient amount of randomness in the means between iterations). In our case, this looks to be true. Second, you want to examine the plot to see how long it takes to reach this stationary phase. In the above example it looks to happen almost immediately, indicating good convergence. The dotted lines represent at what iteration and imputed dataset is drawn. By default the burn-in period (number of iterations before the first set of imputed values is drawn) is 200. This can be increased if it appears that proper convergence is not achieved using the nbiter option on the mcmc statement. Another plot that is very useful for assessing convergence is the auto correlation plot also specified on the mcmc statement using plotsacf. This helps us to assess possible auto correlation of parameter values between iterations. Let39s say you noticed a trend in the mean social studies scores in the previous trace plot. You may want to assess the magnitude of the observed dependency of scores across iterations. The auto correlation plot will show you that. In the plot below, you will see that the correlation is perfect when the mcmc algorithm starts but quickly goes to near zero after a few iterations indicating almost no correlation between iterations and therefore no correlation between values in adjacent imputed datasets. By default SAS, draws an imputed dataset every 100 iterations, if correlation appears high for more than that, you will need to increase the number of iterations between imputed datasets using the niter option. Take a look at the SAS 9.4 proc mi documentation for more information about this and other options. Note: The amount of time it takes to get to zero (or near zero) correlation is an indication of convergence time (Enders, 2010). For more information on these and other diagnostic tools, please se e Ender, 2010 and Rubin, 1987. Example 2: MI using fully conditional specification (also known as imputation by chained equationsICE or sequential generalized regression ) A second method available in SAS imputes missing variables using the fully conditional method (FCS) which does not assume a joint distribution but instead uses a separate conditio nal distribution for each imputed variable. This specification may be necessary if your are imputing a variable that must only take on specific values such as a binary outcome for a logistic model or count variable for a poisson model. In simulation studies (Lee amp Carlin, 2010 Van Buuren, 2007), the FCS has been show to produce estimates that are comparable to MVN method. Later we will discuss some diagnostic tools that can be used to assess if convergence was reached when using FCS. The FCS methods available is SAS are discriminant function and logistic regression for binarycategorical variables and linear regression and predictive mean matching for continuous variables. If you do not specify a method, by default the discriminant function and regression are used. Some interesting properties of each of these options are: 1. The discriminant function method allows for the user to specify prior probabilities of group membership. In discriminant function only continuous variables can be covariates by default. To change this default use the classeffects option. 2. The logistic regression method assumes ordering of class variables if more then two levels. 3. The default imputation method for continuous variables is regression. The regression method allows for the use of ranges and rounding for imputed values. These options are prob lematic and typically introduce bias (Horton et al. 2003 Allison, 2005). Take a look at the quotOther Issuesquot section below, for further discussion on this topic. 4. The predictive mean matching method will provide imputed values that are consistent with observed values. If plausible values are necessary, this is a better choice then using bounds or rounding values produced from regression. For more information on these methods and the options associated with them, see SAS Help and Documentation on the FCS Statement . The basic set-up for conducting an imputation is shown below. The var statement includes all the variables that will be used in the imputation model. If you want to impute these variables using method different then the default you can specify which variable(s) is to be imputed and by what method on the FCS statement. In this example we are imputing the binary variable female and the categorical variable prog using the discriminant function method. Since they are both categorical, we also list female and prog on the class statement. Note: Because we are using the discriminant function method to impute prog we no longer need to create dummy variables. Additionally, we use the classeffectsinclude option so all continuous and categorical variables will be used as predictors when imputing female and prog . All the other variables on var statement will be imputed using regression since a different distribution was not specified. The ordering of variables on the var statement controls in which order variables will be imputed. With multiple imputation using FCS, a single imputation is conducted during an initial fill-in stage. After the initial stage, the variables with missing values are imputed in the order specified on the var statement. With subsequent variable being imputed using observed and imputed values from the variables that proceeded them. For more information on this see White et al. 2010. Also as in the previous proc mi example using MVN, we can also specify the number of burn-in interations using the option nbiter . The FCS statement also allows users to specify which variable you want to use as predictors, if no covariates are given from the imputed variable then SAS assumes that all the variables on the var statement are to be used to predict all other variables. Multiple conditional distributions can be specified in the same FCS statement. Take a look at the examples below. This specification, imputes female and prog under a generalized logit distribution that is appropriate for non-ordered categorical variables instead of the default cumulative logit that is appropriate for ordered variables. This second specification, imputes female and prog under a generalized logit distribution and uses predictive mean matching to impute math, read and write instead of the default regression method. This third specification, indicates that prog and female should be imputed using a different sets of predictors. 2. Analysis and Pooling Phase Once the 20 multiply imputed datasets have been created, we can run our linear regression using proc genmod . Since we imputed female and prog under a distribution appropriate for categorical outcomes, the imputed values will now be true integer values. Take a look at the results of proc freq for female and prog in the second imputed dataset as compared to original data with missing values. As you can see, the FCS method has imputed quotrealquot values for our categorical variables. Prog and female can now be used in the class statement below and we no longer need to create dummy variables for prog . As with the previous example using MVN, we will run our model on each imputed dataset stored in mifcs . We will also use an ODS Output statement to save the parameter estimates from our 20 regressions. Below is a proc print of what the parameter estimates in gmfcs look like for the first two imputed datasets. quot Imputation quot indicates which imputed dataset each set of parameters estimates belong to. quotLevel1quot indicates the levels or categories for our class variables. The mianalyze procedure will now require some additional specification in order to properly combine the parameter estimates. You can see above that the parameter estimates for variables used in our model39s class statement have estimates with 1 row for each level. Additionally, a column called quotLevel1quot specifies the name or label associated with each category. In order from mianalyze to estimate the combined estimates appropriately for the class variables we need to add some options to the proc mianalyze line. As before the parms refers to input SAS data set that contains parameter estimates computed from each imputed data set. However, we also need the option classvar added. This option is only appropriate when the model effects contain classification variables. Since proc genmod names the column indicator for classification quotLevel1quot we will need to specify classvarlevel . Note: Different procedures in SAS require different classvar options. If you compare these estimates to those from the full data (below) you will see that the magnitude of the write . female . and math parameter estimates using the FCS data are very similar to the results from the full data. Additionally, the overall significance or non-significance of specific variables remains unchanged. As with the MVN model, the SE are larger due to the incorporation of uncertainty around the parameter estimates, but these SE are still smaller then we observed in the complete cases analysis. 4. Imputation Diagnostics: Like the previous imputation method with MVN . the FCS statement will output trace plots. These can be examined for the mean and standard deviation of each continuous variable in the imputation model. As before, the dashed vertical line indicates the final iteration where the imputation occurred. Each line represents a different imputation. So all 20 imputation chains are overlayed on top of one another. Autocorrelation plots are only available with the mcmc statement when assuming a joint multivariate normal distribution. This plot is not available when using the FCS statement. 1. Why Auxiliary variables So one question you may be asking yourself, is why are auxiliary variables necessary or even important. First, they can help improve the likelihood of meeting the MAR assum ption (White et al, 2011 Johnson and Young, 2011 Allison, 2012). Remember, a variable is said to be missing at random if other variables in the dataset can be used to predict missingness on a given variable. So you want your imputation model to include all the variables you think are associated with or p redict missingness in your variable in order to fulfill the assumption of MAR. Second, including auxiliaries has been shown to help yield more accurate and stable estimates and thus reduce the estimated standard errors in analytic models (Enders, 2010 Allison, 2012 von Hippel and Lynch, 2013). This is especially true in the case of missing outcome variables. Third. including these variable can also help to increase po wer (Reis and Judd, 2000 Enders, 2010). In general, there is almost always a benefit to adopting a more quotinclusive analysis str ategyquot (Enders, 2010 Allison, 2012). 2. Selecting the number of imputations ( m ) Historically, the recommendation was for three to five MI datasets. Relatively low values of m may still be appropriate when the fraction of missing information is low and the analysis techniques are relatively simple. Recently, however, larger values of m are often being recommended. To some extent, this change in the recommended number of imputations is based on the radical increase in the computing power available to the typical researcher, making it more practical to run, create and analyze multiply imputed datasets with a larger number of imputations. Recommendations for the number of m vary. For example, five to 20 imputations for low fractions of missing information, and as many as 50 (or more) imputations when the proportion of missing data is relatively high. Remember that estimates of coefficients stabilize at much lower values of m than estimates of variances and covariances of error terms (i. e. standard errors). Thus, in order to get appropriate estimates of these parameters, you may need to increase the m. A larger number of imputations may also allow hypothesis tests with less restrictive assumptions (i. e. that do not assume equal fractions of missing information for all coefficients). Multiple runs of m imputations are recommended to assess the stability of the parameter estimates. Graham et al. 2007 conducted a simulation demonstrating the affect on power, efficiency and parameter estimates across different fractions of missing information as you decrease m. The authors found that: 1. Mean square error and standard error increased. 2. Power was reduced, especially when FMI is greater than 50 and the effect size is small, even for a large number of m (20 or more). 3. Variability of the estimate of FMI increased substantially. Im Algemeinen. the estimation of FMI improves with an increased m . Another factor to consider is the importance of reproducibility between analyses using the same data. White et al. (2010), ass uming the true FMI for any variable would be less than or equal to the percentage of cases that are incomplete, uses the rule m should equal the percentage of incomplete cases. Thus if the FMI for a variable is 20 then you need 20 imputed datasets. A similar analysis by Bodner, 2008 makes a similar recommendation. White et al. 2010 also found when making this assumption, the error associated with estimating the regression coefficients, standard errors and the resulting p-values was considerably reduced and resulted in an adequate level of reproducibility. 3. Maximum, Minimum and Round This issue often comes up in the context of using MVN to impute variables that normally have integer values or bounds. Intuitively speaking, it makes sense to round values or incorporate bounds to give quotplausiblequot values. However, these methods has been shown to decrease efficiency and increase bias by altering the correlation or covariances between variables estimated during the imputation process. Additionally, these changes will often result in an underestimation of the uncertainly around imputed values. Remember imputed values are NOT equivalent to observed values and serve only to help estimate the covariances between variables needed for inference (Johnson and Young 2011). Leaving the imputed values as is in the imputation model is perfectly fine for your analytic models. If plausible values are needed to perform a specific type of analysis, than you may want to use a different imputation algorithm such as FCS . Isn39t multiple imputation just making up data No. This is argument can be made of the missing data methods that use a single imputed value because this value will be treated like observed data, but this is not true of multiple imputation. Unlike single imputation, multiple imputation builds into the model the uncertaintyerror associated with the missing data. Therefore the process and subsequent estimation never depends on a single value. Additionally, another method for dealing the missing data, maximum likelihood produces almost identical results to multiple imputation and it does not require the missing information to be filled-in. What is Passive imputation Passive variables are functions of imputed variables. For example, let39s say we have a variable X with missing information but in my analytic model we will need to use X 2. In passive imputation we would impute X and then use those imputed values to create a quadratic term. This method is called quotimpute then transformquot (von Hippel, 2009). While th is appears to make sense, additional research (Seaman et al. 2012 Bartlett et al. 2014) has s hown that using this method is actually a misspecification of your imputation model and will lead to biased parameter estimates in your analytic model. There are better ways of dealing with transformations. How do I treat variable transformations such as logs, quadratics and interactions Most of the current literature on multiple imputation supports the method of treating variable transformations as quotjust another variablequot. For example, if you know that in your subsequent analytic model you are interesting in looking at the modifying effect of Z on the association between X and Y (i. e. an interaction between X and Z). This is a property of your data that you want to be maintained in the imputation. Using something like passive imputation, where the interaction is created after you impute X andor Z means that the filled-in values are imputed under a model assuming that Z is not a moderator of the association between X an Y. Thus, your imputation model is now misspecified. Should I include my dependent variable (DV) in my imputation model Yes An emphatic YES unless you would like to impute independent variables (IVs) assuming they are uncorrelated with your DV (Enders, 2010). Thus, causing the estimated association between you DV and IV39s to be biased toward the null (i. e. underestimated). Additionally, using imputed values of your DV is considered perfectly acceptable when you have good auxiliary variables in your imputation model (Enders, 2010 Johnson and Young, 2011 White et al. 2010). However, if good auxiliary variables are not available then you still INCLUDE your DV in the imputation model and then later restrict your analysis to only those observations with an observed DV value. Research has shown that imputing DV39s when auxiliary variables are not present can add unnecessary random variation into your imputed valu es (Allison, 2012). How much missing can I have and still get good estimates using MI Simulations have indicated that MI can perform well, under certain circumstances, even up to 50 missing observations (Allison, 2002). However, the larger the amount of missing information the higher the chance you will run into estimation problems during the imputation process and the lower the chance of meeting the MAR assumption unless it was planned missing (Johnson and Young, 2011). Additionally, as discussed further, the higher the FMI the more imputations are needed to reach good relative efficiency for effect estimates, especially standard errors. What should I report in my methods abut my imput ation Most papers mention if they performed multiple imputation but give very few if any details of how they implemented the method. In general, a basic description should include: Which statistical program was used to conduct the imputation. The type of imputation algorithm used (i. e. MVN or FCS). Some justification for choosing a particular imputation method. The number of imputed datasets ( m) created. The proportion of missing observations for each imputed variable. The variables used in the imputation model and why so your audience will know if you used a more inclusive strategy. This is particularly important when using auxiliary variables. This may seem like a lot, but probably would not require more than 4-5 sentences. Enders (2010) provides some examples of write-ups for particular scenarios. Additionally, MacKinnon (2010) discusses the reporting of MI procedures in medical journals. Main Take Always from this seminar: Multiple Imputation is always superior to any of the single imputation methods because: A single imputed value is never used The variance estimates reflect the appropriate amount of uncertainty surrounding parameter estimates There are several decisions to be made before performing a multiple imputation including distribution, auxiliary variables and number of imputations that can affect the quality of the imputation. Remember that multiple imputation is not magic, and while it can help increase power it should not be expected to provide quotsignificantquot effects when other techniques like listwise deletion fail to find significant associations. Multiple Imputation is one tool for researchers to address the very common problem of missing data. Allison (2002). Missing Data. Sage Publications. Allison (2012). Handling Missing Data by Maximum Likelihood. SAS Global Forum: Statistics and Data Analysis. Allison (2005). Imputation of Categorical Variables with PROC MI. SUGI 30 Proceedings - Philadelphia, Pennsylvania April 10-13, 2005. Barnard and Rubin (1999). Small-sample degrees of freedom with multiple imputation. Biometrika . 86(4), 948-955. Bartlett et al. (2014). Multiple imputation of covariates by fully conditional specific ation: Accommodating the substantive model. Stat Methods Med Res . Todd E. Bodner (2008).quotWhat Improves with Increased Missing Data Imputationsquot. Structural Equation Modeling: A Multidisciplinary Journal . 15:4, 651-675. Demirtas et al.(2008). Plausibility of multivariate normality assumption when multiply imputing non-gaussian continuous outcomes: a simulation assessment. Jour of Stat Computation amp Simulation . 78(1). Enders (2010). Applied Missing Data Analysis. The Guilford Press. Graham et al. (2007). How Many Imputations are Really Needed Some Practical Clarifications of Multiple Imputation Theory. Prev Sci, 8: 206-213. Horton et al. (2003) A potential for bias when rounding in multiple imputation. American Statistician. 57: 229-232. Lee and Carlin (2010). Multiple Imputation for missing data: Fully Conditional Specification versus Multivariate Normal Imputation. Am J Epidemiol . 171(5): 624-32. Lipsitz et al. (2002). A Degrees-of-Freedom Approximation in Multiple Imputation. J Statist Comput Simul, 72(4): 309-318. Little, and Rubin, D. B. (2002). Statistical Analysis with Missing Data . 2 nd edition, New York. John Wiley. Johnson and Young (2011). Towards Best Practices in analyszing Datasets with Missing Data: Comparisons and Recomendations. Journal of Marriage and Family, 73(5): 926-45. Mackinnon (2010). The use and reporting of multiple imputation in medical research a review. J Intern Med, 268: 586593. Editors: Harry T. Reis, Charles M. Judd (2000). Handbook of Research Methods in Social and Personality Psychology. Rubin (1976). Inference and Missing Data. Biometrika 63 (3), 581-592. Rubin (1987). Multiple Imputation for Nonresponse in Surveys. J. Wiley amp Sons, New York. Seaman et al. (2012). Multiple Imputation of missing covariates with non-linear effects: an evaluation of statistical methods. B MC Medical Research Methodology . 12(46). Schafer and Graham (2002) Missing data: our view of the state of the art. Psychol Methods, 7(2):147-77 van Buuren (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research . 16: 219242 . von Hippel (2009). How to impute interactions, squares and other transformed variables. Sociol Methodol . 39:265-291. von Hippel and Lynch (2013). Efficiency Gains from Using Auxiliary Variables in Imputation. Cornell University Library . von Hippel (2013). Should a Normal Imputation Model be modified to Impute Skewed Variables . Sociological Methods amp Research, 42(1):105-138. White et al. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine . 30(4): 377-399. Young and Johnson (2011). Imputing the Missing Y39s: Implications for Survey Producers and Survey Users. Proceedings of the AAPOR Conference Abstracts . pp. 62426248. The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California. Imputation of categorical and continuous data - multivariate normal vs chained equations Question: Generally speaking, would you say that standard methods of multiple imputation (e. g. those available in PROC MI) have difficulty handling models with mixed (continuous and categorical) data Or would you think (generally) that the multivariate normality assumption is robust in the context of MI for handling continuous and categorical missing data. Answer: Opinion on this is somewhat mixed. A fair bit of work has been done on how to impute categorical data using the MVN model, and some papers have shown you can do quite well, provided you use so called adaptive rounding methods for rounding the continuous imputed data. For more on this, see: CA Bernaards, TR Belin, JL Schafer. Robustness of a multivariate normal approximation for imputation of incomplete binary data. Statistics in Medicine 200726:1368-1382. Lee and Carlin found that both chained equations and imputation via a MVN model worked well, even with some binary and ordinal variables: KJ Lee and JB Carlin. Multiple Imputation for Missing Data: Fully Conditional Specification Versus Multivariate Normal Imputation. American Journal of Epidemiology 2010171:624-632 In contrast, a paper by van Buuren concluded that the chained equation (also known as fully conditional specification (FCS)) approach is preferable in situations with a mixture of continuous and categorical data: S van Buuren. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research 200716:219-242 My personal opinion is that the chained equations approach is preferable with a mixture of continuous and categorical data.

No comments:

Post a Comment