|
|
In this paper, we evaluate Apache Spark
for a data- intensive machine learning problem. Our use case focuses
on policy diffusion detection across the state legislatures in the
United States over time. Previous work on policy diffusion has been
unable to make an all-pairs comparison between bills due to
computational intensity. As a substitute, scholars have studied
single topic areas.
We provide an implementation of this analysis workflow as a
distributed text processing pipeline with Spark dataframes and
Scala application programming interface. We discuss the challenges
and strategies of unstructured data processing, data formats for
storage and efficient access, and graph processing at scale.
|