In this paper, we demonstrate how to
enhance the validity of causal inference with unstructured
high-dimensional treatments like texts, by leveraging the power of
generative Artificial Intelligence. Specifically, we propose to use
a deep generative model such as large language models (LLMs) to
efficiently generate treatments and use their internal
representation for subsequent causal effect estimation. We show
that the knowledge of this true internal representation helps
disentangle the treatment features of interest, such as specific
sentiments and certain topics, from other possibly unknown
confounding features. Unlike existing methods, our proposed
approach eliminates the need to learn causal representation from the
data, and hence produces more accurate and efficient estimates. We
formally establish the conditions required for the nonparametric
identification of the average treatment effect, propose an
estimation strategy that avoids the violation of the overlap
assumption, and derive the asymptotic properties of the proposed
estimator through the application of double machine learning.
Finally, using an instrumental variables approach, we extend the
proposed methodology to the settings in which the treatment feature
is based on human perception rather than is assumed to be fixed
given the treatment object. The proposed methodology is also
applicable to text reuse where an LLM is used to regenerate existing
texts. We conduct simulation and empirical studies, using the
generated text data from an open-source LLM, Llama 3, to illustrate
the advantages of our estimator over state-of-the-art causal
representation learning algorithms. |