at Carnegie Mellon University

Pittsburgh is still defined by a geography of uneven development where modern disparities were built from historic patterns of discrimination. While searching through PghSNAP, I was struck by the similarities between the survey of prevailing building conditions and the Home Owners Loan Corporation (HOLC) map of 1937. So, I created a GIS-based framework to assess the legacies of neighborhood appraisal and lending discrimination in Pittsburgh by intersecting the HOLC map with census data.

Before getting swept up in the analysis, I want to be clear about what the HOLC map is and what it represents, so in the interest of brevity, I have highlighted some of the key aspects of the history. For a more thorough (perhaps too thorough) version you can read the full paper: here.

In the 1930’s, the federal government fundamentally transformed the mortgage market, creating the 30 year mortgage packages that make home-ownership accessible. In an effort to address perceived weaknesses in the housing market, federal officials advocated more ‘scientific’ appraisal methods. Appraisal ideology was forged within a climate of prejudice generally pervasive throughout white society, explicit institutional discrimination at most levels of government throughout the United States, and a heavily skewed distribution of economic power; policy makers saw little value in poor, African-American, immigrant, or Jewish communities and even viewed them as a direct threat to the value of middle class, white communities.

The maps are terrific localized representations of appraisal practices in each of the 239 cities they depicted. The HOLC maps were not directly published and used by bankers and appraisers to make lending decisions but were, nonetheless, certainly influential in the development of biased appraisals: the federal government published the tools, rationales, and examples necessary for banks to create maps of their own.

The first portion of my investigation tried to understand more about the impact of these practices at the time they were being honed. I intersected the 1940 Census, obtained from NHGIS, with the HOLC map. One of the strongest relationships within the appraisal of Pittsburgh neighborhoods was racial segregation. In the graph below, I arranged the communities by their HOLC ranking and then plotted the percentage of white residents in the tract (at this time, Pittsburgh had virtually only two racial groups). Looking at the four plots together, it is clear that a huge portion of Pittsburgh, regardless of value, was exclusively white. The dashed blue line represents the overall percentage of white people in Pittsburgh at the time; if communities were not segregated by race, they would hover around this line. Now consider the valuation of the tracts: not only were African Americans concentrated into a handful of places but they were relegated the lowest quality neighborhoods that were considered to be the least valuable (partially because of their presence). Also, a methodological aside, because census tracts encompass an aggregated area, segregation at the block level could only be starker.

Next, I explored how these historic practices have continued to shape Pittsburgh. I used standardized census data to identify entrenched neighborhood characteristics from 1970 to 2000. Tracts that persistently had the largest proportions of African Americans were almost entirely aligned with red and yellow areas as you can see in the map below. Three main clusters of African American residents appeared: the Hill District, Manchester, and Homewood. These are communities that were historically considered the least valuable and were undermined economically and are, at least in part, still dealing with the effects. Further, tracts that persistently had the highest concentrations of poverty were also heavily focused in red and yellow areas. Red and yellow tracts were also much more susceptible to population loss than green or blue areas as you can see in the graph below. In many ways these spaces have held their position through time and the access, or lack of access, to mortgage financing had long reaching legacies. Groups that historically were victimized by appraisal ideology continue to occupy these spaces. These neighborhoods are likely less stable as well, considering the large presence of poverty and heaviest population losses.

Black communities have suffered from disinvestment

Pittsburgh’s uneven decline in population, 1970-2000

On the other hand, those communities that were uplifted by their historic value largely retained their status into modern times. Tracts with persistently the highest average incomes, home-ownership rates, even for African Americans, and the highest average values were all largely aligned with the historic green or blue categorizations. These communities benefited from unfettered access to the mortgage market and became the most stable, affluent neighborhoods in Pittsburgh because of their relative health. As you can see in the map for highest average values, Squirrel Hill, in particular, maintained its position. The disproportionate access to mortgage funds even continues today: according to PCRG, 7 neighborhoods received 50% of all mortgage dollars in 2015—6 of the 7 were historically rated either green or blue.

Wealthy communities benefited from redlining

What is clear from the assessment of Pittsburgh’s geographic legacies of redlining is that the city is still largely defined by an ugly history of uneven development. As much as we may like to think that we have moved beyond pre-World War 2 or pre-Civil Rights Pittsburgh, we live in a city that is still, at least somewhat, constructed the same way. Policies that have attempted to create equality and opportunity for parts of the city that were left behind have failed to do so. Those parts of the city that were built on their exclusion have maintained their privileged and elevated status. Today, as we are having debates about neighborhood quality, accessibility, and inclusion, we must remember the specific history of uneven development. Are we comfortable with this geography? If not, what are we willing to do, lest it define us for another 60 years?

Devin Rutan graduated from the University of Pittsburgh with a Bachelors of Philosophy in Urban Studies and studied Applied Statistics and GIS. Devin cares about housing and neighborhood development and is currently working with the Northside Coalition for Fair Housing and the Pittsburgh Tenants Union. Originally from the DC area, Devin is an avid basketball fan: Let’s go Wizards! You can follow his work here or connect with Devin here: https://www.linkedin.com/in/devin-rutan-884517134/.

Our cross-functional team from Heinz College and Tepper is interested in looking at the affordable housing scene in Pittsburgh. This was part of a case competition we won, jointly organized by SUDS and the Data Analytics clubs at Heinz and Tepper. Eventually, the topic grew on us as we shared similar experiences while searching for a house to rent or buy. Currently, Pittsburgh has an affordable housing deficit of 17,000 units. Affordability is a reflection of the price of the house, its condition, and the livability of the surrounding area. Thus, solutions to Pittsburgh’s housing challenges need to focus on healthy community development. Keeping that in mind, we propose an “Assess and Address” framework that identifies investor behavior and creates a system of incentives and penalties to address negative influences.

Hazelwood – Our Pilot

Like many other neighborhoods in Pittsburgh, Hazelwood has been hit hard since the city’s steel mills shut down. Still, it has maintained its tight community spirit. New developments like technology company investments, startups and malls are reviving communities in neighborhoods like East Liberty and Lawrenceville, but at the risk of gentrification. Hazelwood is representative of what has been happening across Pittsburgh. Recent developments like Almono and Summerset at Frick Park have the potential to revive the neighborhood. At the same time, we must also ensure that these new developments bring positive change and not displacement for longtime residents.

Hazelwood’s community focus is still intact, but the recent developments could attract bad investors. As we see from the comparison of average property prices, Hazelwood does not have cases where properties in poor or unsound condition are sold at inflated values, unlike the rest of Allegheny County. While this is a good sign for Hazelwood, it still has some vacant properties and many houses in average or fair condition, which could potentially attract attention from bad investors in the future, if they aren’t already there.

Note: The graph highlights outliers indicating that some properties in poor or average condition are sold for significantly inflated values. The edges of each box in the plot indicate the interquartile range of sale values for each property condition. The flat line within each box is the median. The dotted lines are the outliers. For our assessment, we considered all valid sales from the property assessments data maintained by the WPRDC. Also, we used zip code 15207 to distinguish Hazelwood from the rest of Allegheny County. Although 15207 covers parts of Glen Hazel and Greenfield, it represents the challenges we aim to address.

Investor Behavior – Who Are These “Bad” Investors?

A neighborhood stands to benefit if houses in poor or unlivable conditions are purchased and redeveloped. However, some investors are only interested in buying and reselling houses to make a quick profit, without improving their condition or spending on maintenance. Unlike “rehabbers”, “flippers” and “milkers” buy and keep properties in distressed conditions, and hope to sell them off for a profit as quickly as possible. Such behavior does not attract the healthy investors, and also negatively affects the community and quality of life in the neighborhood.

We decided to look at data driven ways to identify and predict bad behavior using the property assessments data maintained by the WPRDC. Although we do not have any information on house owners, our cognitive solution identifies and flags potentially bad investments and highlights insightful characteristics.

The data does not have class labels that identify bad investments. We ran a k-means clustering algorithm on ownership duration, property value appreciation, and sale price to set our class labels for good and bad investments needed for analytical modeling. We avoided using general definitions for flippers to determine class boundaries as that could ignore any hidden patterns and add bias. For example, any house resold within 12 months for less than $100,000 is potentially a transaction conducted by a flipper. However, using this information alone to set class boundaries ignores several transactions where the property was held for only a little more than a year. The results from clustering directed us to target parcels that were

Owned for less than three years, and,

The property value depreciated, or appreciated less than 15%, or was sold for less than $90,000.

Any parcels that met the above condition were flagged as properties owned by potentially bad investors, while the rest were labeled as unsuspicious.

Once we were able to set the labels for good and bad investments, we ran few classification algorithms to predict investor behavior. We excluded 20% of the data for testing, and trained the remaining 80% data on different classifiers like Gradient Boosting, Random Forest, and Conditional Inference Trees (a modeling technique based on unbiased recursive partitioning). All returned an AUC between 0.70-0.72. Gradient Boosting and Random Forest with feature selection returned marginal improvements, but not significant.

Our primary challenge was to not let our results get affected by any bias in the data. A majority of the records are not classified as bad investments, and around 90% of the parcels are not red flagged. This makes sense, as we cannot expect the property market to be completely overrun by flippers. Additionally, the assessments data has missing values for some fields, including those which determine our class labels. For example, some parcels do not have previous sale records, making it difficult to determine how long the house was held before it was sold. We cannot assume that this is missing data, as the property may not have changed ownership more than once in its lifetime. Further, the assessments data set gives us the parcel characteristics as of now, and not as when the property was last sold. For example, the condition of a property may have either improved or deteriorated after it was last sold four years ago, but we have no way to find out. For this reason, the simpler models did not perform well. Logistic regression proved to be computationally expensive with a large number of categorical variables, and decision trees performed poorly on test data.

A city like Pittsburgh would like to identify and avoid bad investor activity. However, in an effort to maintain housing affordability, the city cannot drive away potentially good investments that can develop and enrich its vibrant community. To maintain this balance, we propose an Assess and Address framework that gives actionable recommendations.

Assess

Confirm Results with In-Person Observation

Our analytical model can highlight and flag properties that are potentially at a risk of being owned by flippers. The city inspectors and community leaders can export a list of such at-risk targets, look up information on their owners or landlords, and monitor their behavior.

Address

Sanction Bad Practices

Install practices and systems that would discourage landlords from mistreating their tenants and violating requirements for minimum standards. Our suggestions include:

Minimum Property Standards established by the Allegheny County Health Department for rental properties.

Rental Registration – Force landlords to act responsibly towards tenants. A good example is the Probationary Rental Occupancy Permit (PROP) set by the city of Raleigh, NC, which aims to ensure better housing quality for tenants and discourages landlords to violate City Codes.

Foster Positive Practices

These are programs that would encourage good investors and community homeowners to invest in enriching and developing the city’s community spirit. Some suggestions and examples of their implementation include:

We hope to take this initiative forward with the help of SUDS at Carnegie Mellon University and present our findings to the city council at Pittsburgh.

Nick Kharas graduated from Carnegie Mellon University with a Masters degree concentrating in Data Analytics and Business Intelligence. Prior to his time at CMU, he was a business intelligence and data warehousing SME at a Japanese multinational financial holding company. When not a data buff, he enjoys travel, sport and meeting new people. Click here to check out his work on GitHub. You can also connect with Nick at https://www.linkedin.com/in/nickkharas.

Claire Jacquillat is an MBA candidate at Carnegie Mellon’s Tepper School of Business. She focuses her studies on Operations management and Operations research. She is the president of Tepper Data Analytics Club where she strive to foster an integrated use of business analytics in various industries. Before starting her MBA at the Tepper School, she worked as a strategist in Sales Enablement for a Fortune 500 company. You can connect with Claire at https://www.linkedin.com/in/clairejacquillat/en

Erin Yanacek is an MBA candidate at Carnegie Mellon’s Tepper School of Business. In summer 2017, Erin will join McKinsey as a Summer Associate. Prior to business school, Erin was a classical musician. She founded a non profit organization, the Chamber Orchestra of Pittsburgh, and toured internationally performing and teaching classical trumpet. You can connect with Erin at https://goo.gl/BEWz1L

Maksim Khaitovich is an MBA candidate at Carnegie Mellon’s Tepper School of Business. In summer 2017, Maksim will join A. T. Kearney as Summer Data Science Associate. Prior to business school Maksim worked as an engineer and IT consultant in fintech and wireless communications. You can connect with Maksim at https://www.linkedin.com/in/maksim-khaitovich-828a2b47/

The dataset and preliminary analysis are largely pulled from Albert Y. Kim and Adriana Escobedo-Land’s write up in the Journal of Statistics Education.

The data consists of the public profiles of 59,946 OkCupid users who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year, and had at least one picture in their profile. Using a Python script, data was scraped from users’ public profiles on June 30, 2012; any non-publicly facing information such as messaging was not accessible.

Furthermore, text responses to the 10 essay questions posed to all OkCupid users are included as well, such as “My Self Summary,” “The first thing people usually notice about me,” and “On a typical Friday night I am…” For a complete list of variables and more details, see the accompanying codebook.

Some questions:

How do the heights of male and female OkCupid users compare? What about ages?

What does the San Francisco online dating landscape look like? Or more specifically, what is the relationship between users’ sex and sexual orientation?

How accurately can we predict a user’s sex using their listed height?

Are there differences between the sexes in what words are used in the responses to the 10 essay questions?

What trends or relationships in the data can we generalize to the rest of the San Francisco population? To the wider population? For which analyses does the fact that the dataset came from OKCupid make it less generalizable? What about coming from San Francisco?

Mini Tutorial: Albert Y. Kim and Adriana Escobedo-Land’s article accompanying the dataset gives a walk through of how to do summary statistics, conditional probabilities, predictions, and text analysis in R using this dataset. Good stuff in there!