top of page

RESOURCES

Code on Laptop Computer

Quick note: I will continue to update this page as more resources become available. 

CODE & DATA

This GitHub repository contains all current code and derivative analytical data files associated with the project. The main technology architecture consists of Python with the machine learning library, scikit-learn (Pedregosa et al., 2011); MongoDB, a document-based database; Amazon Web Services for cloud computing resources; and GitHub for version control.  There is also some R in there, too, primarily for data visualization using ggplot2 and modeling topic transitions via the MSM package

Patent and raw website data are not included in the GitHub repo due to size constraints.   Patent data may be found via the USPTO's PatentsView bulk download page.  Please contact me if you would like access to raw website data, or even better, start collecting it yourself!  

WORKSHOP

The workshop is intended to introduce students and researchers to the sample frame generation and web scraping method.  The workshop then segues into cleaning and exporting data for further analysis. There are six labs organized into two sessions: 

Session 1 (120 minutes)

  1. Introduction to website crawling and analysis for research in Science and Innovation Policy and Management

  2. Identifying firm assignees in three high-technology industries 

  3. Identifying URLs (lab)

  4. Identifying employment data (lab)

  5. Website scraping (lab)

 

Session 2 (120 minutes)

  1. Thinking about data quality issues (lab)

  2. Topic modeling (lab) 

  3. Exploratory narrative analysis (lab)

  4. Summary and discussion  

As of April 2019, I have taught this workshop three times to approximately 30 individuals at The American Institutes for Research (December 2018), Georgia Institute of Technology’s School of Public Policy (January 2019), and University of Manchester’s Alliance Business School (UK) (April 2019 remotely).

Please note that the workshop does not contain the latest code and should be consulted only for pedagogical purposes.  I may prepare and upload short videos on YouTube to facilitate the self-learning experience, depending on demand/interest. 

First start with the getting started guide, and then consult the workshop slides.  Next, check out the main branch on GitHub to get started with the code on your research project.  Additional technical details may be found on GitHub via the README. 

bottom of page