My Capstone Project: Exploring Toronto and Searching for the Best Place to Establish an Indian Restaurant

My First Attempt at a Data Science project

Shaunak Varudandi

Published in

Towards Data Science

9 min readJan 1, 2020

Introduction

Hello esteemed audience, this is my first time writing a blog and publishing it out there for the masses to read it with scrutiny. I am writing this blog as part of my IBM Data Science Capstone Course. I was assigned the task of formulating a hypothetical problem and coming up with a data science-oriented solution to unearth meaningful insights and conclusive results. As the title suggests, the task here is to explore the beautiful city of Toronto, to traverse through its neighborhood areas, and finally come up with a few suggestive neighborhoods that have business potential in terms of opening a new Indian restaurant.

Target Audience

While formulating the problem statement, I took into consideration the prospect of coming up with a predicament that could be faced by individuals in the real world. Hence I came up with a problem statement that aims at finding the best possible location for a proposed Indian restaurant in the city of Toronto.

This will serve two purposes for two different audience sets. Firstly, this will help individuals who are aiming to start a new business in the hospitality sector (i.e. Restaurants) find a place that has the least concentration of Indian restaurants. Secondly, this could help tourists, choose places (boroughs) based on their personal preferences. For example, a borough with a good bunch of restaurants or a borough which is home to a considerable number of parks that tourists can possibly be interested in.

Data Set

The data set required for the following project was acquired from three different data sources. The three data sources are listed below,

A Wikipedia Page to fetch boroughs and neighborhoods of Toronto city.
A .csv file to fetch latitudes and longitudes corresponding to each postal code.
The foursquare api to fetch different public venues in the vicinity of the neighborhood.

The Wikipedia page contains a table of postal codes followed in Toronto, along with the boroughs and neighborhoods in Toronto city. The .csv file provides us with the latitude and longitude coordinates of each postal code followed in the region of Toronto. This data is beneficial since these coordinates are then used in tandem with the four square api to give out a list of popular venues in each neighborhood.

The data is comprehensive and yields valuable insights related to Toronto city that eventually helped us in unearthing conclusive results and observations. The data source, as it is perceived at the start of the project is unclean and required intensive pre-processing in order to convert it to a working set, capable of handling machine learning algorithms and visualization operations that were implemented on it.

The data in the form of table looks something like this before it enters the pre-processing phase of the project.

As soon as the data acquired from the data source has been processed using data pre-processing tools, it looks like the table given below.

Data Pre-processing

The first step I performed was to scrape data from the Wikipedia page that consisted of all the boroughs and neighborhoods along with their postal codes. I converted it into a data frame since they are the best data structure to work with when it comes to analysis using visualization techniques. The data frame, still consisted of many values that can be treated as missing values, since the postcodes were not assigned to any borough or neighborhood. Missing values can cause a discrepancy in results when we approach the later stages of the project. Hence, I got rid of all the rows that had missing values present in them.

The second step included importing data from a .csv file. The .csv file consisted of latitude and longitude coordinates of each postal code. This .csv file was imported into a data frame for ease of analysis in the later stage. Followed by which, I merged the data frame consisting of borough and neighborhood information and the data frame consisting of the coordinate values. The merge was implemented on the postal code column which was later dropped from the final table since it was not of any use for further analysis.

Data Analysis

The data analysis phase included two significant tasks that had to be done in order to get answers to our problem statement. The two aspects of our problem statement included.

Borough Analysis.
Finding the best possible neighborhood for establishing an Indian restaurant in Toronto city based on the number of Indian restaurants in the vicinity of the chosen spot, i.e. (Choosing a neighborhood with minimum competition).

Firstly, I started with borough analysis. In order to get the data required for the different venues in a particular borough, I used the foursquare api. Foursquare api was linked to my code when the client id, client secret, and the version of foursquare api was passed. This meant that I had a connection with foursquare api, and that I can just call the foursquare api for any venue information required, pertaining to any borough in Toronto city. Since the project is based on borough-wise analysis. I split the final, clean data frame into separate data sets where each table will contain data pertaining to only one borough. This was done by retaining the rows that had the borough of our interest associated with it. After completing the above step, I wrote a function that would call the four square api and access data such as venue name, venue category, venue latitude, venue longitude and later, combine it with the borough table that we extracted in the earlier step. I also dropped the borough column since it is not necessary for our analysis and finally I grouped the venue categories based on the number of times it appeared in the data set. A sample data set of grouped venue categories is shown below.

Data set with venue count (*Image by author*)

The data frame that I have come up with after a thorough data analysis phase is then plotted on to a bar chart for ease of visualization and easy readability. A bar chart for Scarborough is depicted in the image given below.

Bar chart with venue count for Scarborough (*Image by author*)

As the next step of my capstone project. I started with the analysis of the whole Toronto city with the aim of finding neighborhoods and boroughs that could be best suited for establishing an Indian restaurant. The process flow followed for solving this problem statement is similar to the one followed for borough analysis. We again used the same function that was used for borough analysis to call the foursquare api. But instead of passing the data frame which was segregated according to individual boroughs, we passed the data frame containing information about all boroughs and neighborhoods in Toronto city. This gave us a large data frame containing information about almost all venues in Toronto city along with the categories of those venues. We then extract rows having information about Indian Restaurants and discard the rest of the entries in the data frame since they are of no use to us. The data set that we have in hand now is a concise data set with information important to our project. We then create a new data frame that consists of a count of Indian Restaurants in each borough. The data frame is depicted in the image below.

Count of Indian restaurants in each borough (*Image by author*)

This data is then plotted on a bar chart for ease of understanding and to also grab the attention of viewers who are reading this document.

A bar chart depicting the count of Indian restaurants in each borough (*Image by author*)

Moreover, I also planned on depicting the location of Indian Restaurants on the map of Toronto. The map will have a marker at the position of the restaurant, this depicts that an Indian restaurant is present in that neighborhood. Additionally, when a person clicks on the marker, a label pops up depicting the number of Indian restaurants in that neighborhood. This will give a very good idea to viewers about the location of Indian restaurants by looking on the map, at the same time it will depict the count of Indian restaurants in a particular neighborhood. A person who plans to open an Indian restaurant will avoid places that have a large number of Indian restaurants and will look for places that pose minimum competition. This can be done by viewing the map and studying the location of Indian restaurants carefully. Given below is the map that portrays the location of Indian restaurants in Toronto city.

Map of Toronto portraying the location of Indian restaurants in Toronto city (*Image by author*)

Results

The table presented below provides readers with brief information about the different venue categories that can be found in a particular borough. This information was retrieved from the data returned by four square api.

Overall Analysis of Boroughs in Toronto City (*Image by author*)

In case any of the readers are interested in a detailed borough analysis, or are interested in checking out the various bar charts that were created for each borough, or if they are interested in checking out the map that was created in order to portray the locations of Indian restaurants in Toronto city, they can go through my Github repo.

Assumptions and Considerations

During the course of this project, there have been a few assumptions and considerations that were taken into account so as to help in the successful completion of this project. While going through all of my implementation steps, the below-given instances were taken into consideration and should be kept in mind while glancing through the process steps.

During the borough analysis phase, radius passed during the foursquare api call was considered to be 500 meters.
During the second phase of the project, which included finding an optimal place for setting up an Indian restaurant, radius passed during the foursquare api call was considered to be 700 meters. The change was necessary since taking a smaller radius resulted in missing out on some neighborhoods.
The results returned by foursquare api can vary on a day to day basis, hence the number of venues returned by foursquare api can have slight changes.
While determining the best location for setting up an Indian restaurant, only one factor was taken into consideration, i.e. to prevent opening an Indian restaurant in a neighborhood that already houses a considerable number of Indian restaurants (i.e. search for locations that pose minimum competition).
Two boroughs, Mississauga and Queen’s Park were eliminated from our analysis since the search for venues in the aforementioned boroughs did not return a considerable number of results to work on.

Conclusion

During the course of this capstone project, I was able to apply different data science techniques and tools that I learned in the IBM Data Science course. This helped me unearth meaningful insights from the data analysis that I did on the Toronto data set. The aspects I uncovered during the phase of data analysis are listed below.

Borough Analysis

Coffee shops are a venue that has a very high rate of occurrence in almost all the boroughs.
Mississauga and Queen’s Park have very few venues to go to or choose from if you are a tourist.
Parks are the next venue that have the most occurrences amongst the different boroughs.
Downtown Toronto has the maximum and the most varied choices of venues to choose from for a tourist.

Toronto City Analysis (for establishing a new Indian restaurant)

East Toronto and Central Toronto are the two boroughs with the maximum number of Indian Restaurants.
India Bazaar, Thorncliffe Park and The Beaches West are the neighborhoods with the maximum number of Indian restaurants.
North York, Queen’s Park and Mississauga are locations ideal for opening a new Indian restaurant based on our ideology that the new business will face minimum competition there.