This work focuses on the data science challenge problem of predicting the decision for past immigration visa applications using supervised machine learning for classification. I describe an end-to-end approach that first prepares historical data for supervised inductive learning, trains various discriminative models, and evaluates these models using simple statistical validation methods.

The H-1B visa allows employers in the United States to temporarily employ foreign nationals in various specialty occupations that require a bachelor’s degree or higher in the specific specialty, or its equivalents. These specialty occupations may often include, but are not limited to: medicine, health, journalism, and areas of science, technology, engineering and mathematics (STEM). Every year the United States Citizenship and Immigration Service (USCIS) grants a current maximum of 85,000 visas, even though the number of applicants surpasses this amount by a huge difference and this selection process is claimed to be a lottery system. The dataset used for this experimental research project contains all the petitions made for this visa cap from the year 2011 to 2016.

This project aims at using discriminative machine learning techniques to classify these petitions and predict the “case status” of each petition based on various factors. Exploratory data analysis is also done to determine the top employers, the locations which most appeal for foreign nationals under this visa cap and the job roles which have the highest number of foreign workers. I apply supervised inductive learning algorithms such as Gaussian Naïve Bayes, Logistic Regression, and Random Forests to identify the most probable factors for H-1B visa certifications and compare the results of each to determine the best predictive model for this testbed.