Are there biases in Yelp ratings for restaurants in different socioeconomic statuses?

Decision context maker

With our decision, what we are trying to inform the Office of Economic Development is to change how they decide to allocate their public fundings. With the information we have, we hope to be able to change allocations of their fundings to improve quality of life in underfunded communities. We chose to base our decision context around OED city officials in order to raise awareness on the issues of economic disparity among less funded community along different census tracts. With this, it will provide an opportunity for business owners to be more successful without the social impact of Yelp bias.

Yelp Restaurants in Seattle

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=47.630305,-122.333184&zoom=11&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false

This map above the locations for all of the restaurants in the Seattle Area that we used for our project. In the future, we will be plotting each restaurant with a certain color that is determined based on their current rating. 1852 restaurants from 157 different category were evaluated and used for this project.

This graph is plotting the distribution of restaurants grouped by rating from 1 to 5 on a 0.5 scale. The x axis indicates the restaurant rating with 1 being the lowest and 5 being the highest. The y axis holds percentage in number of restaurants for each rating. The percentage for each rating level was calculated by dividing the frequency of restaurants with that rating by the sum of all restaurants in Seattle.

The line graph suggested that ratings for Seattle restaurants are on a normal distribution. It indicates that out of all the restaurants in the Seattle area, a rating of 4 is the most common and more than 70% of the restaurants have a rating higher than 3.

The violin plot above demonstrates the distribution of median household income at specific ratings. The distribution of median household income is moderately skewed for restaurants with 1.5, 3 and 4 rating. The mean point of median household income for 1.5, 3, and 4 ratings are all below the median; meaning the distribution is skewed to the left. In addition, the median point of the restaurant median household income increases as the rating increases.

The data we had on the restaurants was broken down into three categories of high, mid and low SES based on median household income. It should be noted that it was not separated based on Seattle household income. The cut off for high socio-economic restaurants are the top 33% based on median household income, low socio-economic restaurants are the bottom 33%, and mid socio-economic restaurants are the cut off in between the high and low. This graph is plotting for the High SES which was determined to be a median household income of a minimum of $74,559 or more. This unique census tract holds 54 counts.

The restaurants in this data are within census tracts that have a median household income between $59,275 to $74,559. We have found there are 28 unique census tract. The x-axis are the unique census tracts in the dataset and the y-axis counts the number of restaurants in each census tracts. Out of the three socioeconomics, this bracket contains the least amount of census tract, which may imply there is less area that has restaurants spread out or there are not a lot of census tracts that falls within the middle socioeconomic status.

The restaurants in this data are within census tracts that have a low household income lower than $59,275. We have found there are 39 unique census tract. The x-axis are the unique census tracts in the dataset and the y-axis counts the number of restaurants in each census tracts. We see that there is a disproportionate spread in number of restaurants in each census tracts within the lower household income bracket. This can indicate a bigger, densely populated census tract or a more commercialized area to open restaurant business.

The graph above is plotting the distribution of restaurants based on their reviews while being grouped by High/low/mid SES. The x axis indicates the price level or $ sign with low being the same as 1 dollar sign and high being equivalent to 4 dollar signs. The y axis holds the number of the restaurants in the price level.

This graph demonstrates that Middle SES areas and Low SES areas have many cheap restaurants (1 $s), with Low SES having the most and High SES areas having the least. When it comes to restaurants that are medium-priced (2 $s), Middle SES areas have the most, and Low SES areas have the second most. This price category also has the most number of restaurants. The moderately expensive restaurants (3 $s) tend to be prevalent in the Middle SES areas and less prevalent in High SES areas. Low SES areas have the least. For the most expensive restaurants (4 $s), Middle SES areas have the most, High SES areas have the second most and the Low SES areas have the least. This category also has the fewest number of restaurants. It seems based on the data that majority of the restaurants are in the $ and $$ categories of price.

Features

With our analysis, the dataset provides a multitude of features and factors relating to restaurants quality. Every Yelp restaurant in our dataset has the features below that we can analyze:

Rating (integer): the restaurant’s average rating for given by Yelp community users

reviewCount (integer): the restaurant’s total number of reviews given by Yelp community users

recentHealthViolationType (string): the type of violation (red or blue) the restaurant had during the most recent restaurant inspection. Red refers to high risk factors that considered improper practices or procedures identified as the most prevalent contributing factors of foodborne illness or injury. Blue refers to low risk factors that are considered preventive measures to control the addition of pathogens, chemicals, and physical objects into foods.

recentHealthInspectionScore (integer): the restaurants total points of violation given by King County food inspector during the most recent restaurant inspection. Inspection point is a cumulative violation point of red and blue violation type. Each violation within red high risk factors and blue low risk factors has an associated points. Health Inspection Score is the sum of violation points during a single restaurant inspection visit.

totalInspectionScore (integer): the total sum of all historical health inspection score for the restaurant given by King County food inspector.

totalInspectionCount (integer): the total count of inspection visit at the restaurant

avgInspectionSore (integer): the overall average inspection score of the restaurant. Formula: totalInspectionScore / totalInspectionCount

restaurantTotalMonths (integer): the estimated total months the restaurant has been opened. The number is calculated by the date difference between the most recent and earliest inspection date.

restaurantMaxSeats (integer): the maximum number of seatings described by the King County food inspector.

recentHealthInspectionResult (string): the inspection condition (Complete, Incomplete, Not Ready for Inspection, Unsatisfactory, or Satisfactory) given at the end of the most recent food inspection visit by a King County food inspector

recentHealthInspectionGrade (integer): the restaurant food safety rating given by a King Country food inspector. (1 - Excellent, 2 - Good, 3 - Okay, 4 - Needs to Improve)

To further investigate Yelp restaurants rating in various socio-economic neighborhood, we trained a regression model using the features above. In our regression model, we looked to see what predictor variables (features) are important to the outcome variable (rating). Prior to training a regression model, we evaluated the above features to see which one has a strong correlation with restaurants rating.

recentHealthViolationType, recentHealthInspectionResult, and recentHealthInspectionGrade are categorical variables, which we converted into dummy variables for modeling.

Features Ranked by Importance for Yelp Restaurants Rating in high Socio-economic area

Features Ranked by Importance for Yelp Restaurants Rating in low Socio-economic area

SVM Regression Model for Yelp Restaurants

As a means to analyze Yelp restaurants rating, we trained two regression model: restaurants in high socio-economic and restaurants in low socio-economic neighborhood. For prediction in our model, however, we used the contrasting socio-economic data; meaning, inputting restaurants in low socio-economic restaurants for a high socio-economic model and vice versa. Next, we built a residual graph for each model. By looking at the residuals, we can evaluate how a set of restaurants would perform in various socio-economic regression model. If a majority of the residuals lie above the 0 line in a residual graph, the model suggested an underestimation for the predicted set of restaurants. If a majority of the residuals lies below the 0 line, the model suggested an overestimation for the predicted set of restaurants.

High Socio-Economic Restaurants

## Overestimated residuals proportion:  0.5044248
## Underestimated residuals proportion:  0.4955752

The residual graph for high socio-economic model suggested a neutral split among residuals. Approximately 43% of the predicting restaurants had a negative residuals and 57% had a positive residuals. Based on the residual analysis, the regression model for restaurants rating in high socio-economic neighborhood suggested unbiased.

Low Socio-Economic Restaurants

## Overestimated residuals proportion:  0.4385965
## Underestimated residuals proportion:  0.5614035

The residual graph for low socio-economic model suggested an unequal split among residuals. Approximately 37% of the predicting restaurants has a negative residuals and 63% has a positive residuals. Based on the residual analysis, the regression model for restaurants rating in low socio-economic neighborhood suggested biased. The residuals suggested 63% of the rating of restaurants in high socio-economic were underestimated in a low socio-economic restaurants regression model.

Linear Regression Model for Subway Restaurants

To further investigate the results, we ran the same regression models for Subway restaurants in high socio-economic and low socio-economic neighborhood and plot the residuals. Below is the results:

## Overestimated residuals proportion:  0.2
## Underestimated residuals proportion:  0.8

## Overestimated residuals proportion:  0
## Underestimated residuals proportion:  1

Conclusion

From our analysis of 1852 restaurants, we matched each restaurants’ location to its census tract to find out the median household income within the restaurant’s location. From there, we split the data to low, mid, and high socioeconomic status (SES). Based on the regression model training with our data, the predictive model of high SES areas doesn’t create an overestimation or underestimation of the actual data while the predictive model of low SES areas underestimated the actual data. These results indicate that in low SES areas, approximately 63% of the Yelp ratings create an underestimation of the actual rating of restaurants in those areas and an approximately 37% of overestimation. The difference in proportion is not a substantial enough to consider the result as significant. Furthermore, the additional investigated residual analysis on Subway restaurants model in high and low socioeconomic area suggested a huge proportion of underestimated residuals, however, due to its limited sample size, we could not conclude the result was significant. Therefore, we cannot reject our null hypothesis that restaurant’s ratings have biased based on socioeconomic status because our overestimation and underestimation numbers do not have a significant difference.

The analysis may not provide a conclusive evidence that there is biased in restaurants rating based on socioeconomic status, however, it provides a foundation to future research on restaurants rating. There is an array of factors that contribute to a restaurant’s quality such as location proximately, food freshness, taste, and many more that we have not included in the analysis. In addition, there are also other information that we can pull from the City of Seattle such as crime rate/index in neighborhoods, median house rental price, number of individual household members, average age of a neighbourhood, and more. In future research, we hope to include those additional factors and features to enhance our regression model and findings.