The Analytics Paradigm — Steps to Arriving at the Most Truthful Answer
The whole point of this is: What are the steps that you will take to get to your most truthful answer?
Imagine you’re at a party, and someone asks, “Do Americans prefer Italian or Mexican food?” Instead of just guessing, you decide to find out with data—because that’s what data analysts do.
Suddenly, your friend jokes, “Why don’t we just ask the first nine people we see and call it a day?” Well, as funny as that sounds, it’s not too far from what you might actually do in a small-scale study.
In the world of analytics, every question is a chance to uncover hidden truths in data. As a data analyst, your job is more than just number-crunching; it’s about guiding a journey from uncertainty to understanding.
So, for instance, how would you tackle the challenge of figuring out if Americans prefer Italian or Mexican food?
Specifically, answering questions like:
But the whole point of this is: What are the steps that you will take to get to your most truthful answer?
Now, before diving in, take 3–5 minutes to reflect on how you will approach solving this. Think about what data to collect, the relationships to look for, and how to ensure your answer is the most accurate. This reflection will help you get the expected answer when you ask the 9 people.
I’m sure some of you had interesting ideas. Now, let me show you how you would approach this to arrive at the most truthful answer.
First things first, you take the question and analyze it, a process I call decomposing the ask. So, what does it really mean to ask, “What food do Americans prefer — Italian or Mexican?” There are several aspects to break down.
First, when we talk about food, are we discussing takeout only, restaurant dining, or home-cooked meals? The term “food” is too broad and general to tackle without narrowing it down.
Next, we need to define “Americans.” It’s crucial not to get political, but we need to specify who we’re talking about. Are we considering people on the East Coast, the West Coast, or maybe those in New Jersey? Is it all Americans or just a sample? We have to clearly identify our target group.
Then, we look at the types of food. What do we mean by Italian or Mexican food? Is it food that originated in Italy or Mexico, or food that Italians or Mexicans eat in America? What about fusion cuisine, like Italian-Asian fusion? Does that count? We must be clear on these definitions.
We also need to understand what we mean by “prefer.” Does preference mean the food people eat most frequently, or the food they would choose if cost weren’t an issue? Is preference different across demographics? Since preference is subjective, how do we make it more objective?
Additionally, Italian and Mexican foods are broad categories. Is pizza considered Italian? Is Taco Bell truly Mexican? Are we talking about fast food or traditional cuisine? Does making pasta at home count as Italian, or are we only talking about restaurant food?
Finally, we must consider the quality of the dining experience. Are we comparing the best versions of these foods or just average experiences?
Step one is always to decompose the ask. By breaking down the question into specific, clear components, we turn a subjective question into an objective one. This way, we set the parameters and narrowly define the ask, ensuring clarity and focus in our analysis.
The second step is to identify the data sources. Where are we going to get all of this data? Google Maps is a great option because it has robust APIs, but it can be costly. Are there any free alternatives? Can we use platforms like Yelp or TripAdvisor to gather the data we need?
Considering there are millions of restaurants in the U.S., we have to ask ourselves if we need data from all of them. Probably not. At this point, the goal is to pinpoint which data sources will be most useful.
Yelp, for instance, is a dynamic data source with a wealth of information that we need. Therefore, we might decide to streamline our efforts and focus on using Yelp to gather the data.
We craft our plan and decide what we need to measure. We need a clear strategy to guide our analysis. What data will we gather, and how will we interpret it? Imagine diving into Yelp reviews, focusing on the number of reviews and average ratings.
These metrics aren’t just cold, hard numbers; they represent people’s real preferences and experiences with Italian and Mexican cuisine. By identifying patterns and trends, we’ll uncover the most truthful answers to our questions.
This process is about more than data — it’s about understanding the real stories and preferences hidden within the numbers.
Next, we dive into the data collection process. Yelp has a vast amount of data, which we can access either through web scraping or using their backend APIs. Web scraping involves extracting data directly from websites, while APIs provide a structured way to request and retrieve data. We’ll learn more about these methods in the coming posts.
When we use an API, we’re essentially tapping into a controlled gateway provided by major websites and data providers. These APIs, whether free or paid, come with documentation that guides us on how to retrieve data in a structured manner. It’s like having a key to access valuable information without overwhelming the provider’s system.
Once we understand how to use these APIs, we can start building our data retrieval plan. This involves identifying the tools and technologies we’ll use, such as Python for coding. We’ll also gather supplementary data from sources like the Census Bureau, which conveniently offers a Python library to fetch data automatically. By merging this data with our Yelp information, we create a comprehensive dataset ready for analysis. This plan ensures that we have all the necessary data to answer our questions accurately.
Once we start retrieving the data, we’ll follow a step-by-step process. We’ll begin by selecting a random zip code, then writing Python code to create a request and save the data. We’ll repeat this process multiple times until we have a complete dataset.
If this sounds overwhelming, don’t worry. For those familiar with coding, this will be straightforward. For beginners, with a bit of patience and practice, even the complex parts will soon make sense.
We’ll also learn how to process the data we retrieve. This involves understanding the URLs and the data they return. By the end of this, you’ll be well-equipped to handle and analyze the data confidently.
And then ultimately, once we have collected all the necessary data, our next step is to assemble and clean it. We have multiple data sources: one for each city and one for each type of restaurant, capturing the top 20 restaurants for each category along with their ratings and number of reviews.
Our goal is to consolidate all this information into a single comprehensive dataset, akin to a large Excel spreadsheet.
Next, let’s analyze our final dataset for trends.
We’ll notice that the average ratings usually stay around 3.8. We’ll look at the review counts, like 476,000 versus 573,000, and see how often Mexican restaurants do better than Italian ones in the same area, and vice versa.
These numbers help us understand what people prefer. While a basic bar graph might not show all the details — both cuisines get around 3 in ratings — a closer look shows Italian food gets more reviews than Mexican food, even with similar ratings.
We’ll do a statistical test (T-test) to check how important these results are. From our first look, it seems people don’t prefer one cuisine over the other, as both have similar ratings and review counts.
As any good scientist would, we must acknowledge the limitations of our study. And there are quite a few to consider here.
- Firstly, the demographics of Yelp users may not fully represent the broader American population.
- Secondly, dining experiences at restaurants may not accurately reflect preferences for homemade meals. We focused on those who use restaurants, but what about those who prefer cooking at home? This group remains unaccounted for in our data.
- Additionally, there’s the fine dining bias to consider. Italian restaurants, for instance, may include more upscale establishments, potentially influencing ratings and averages in favor of their cuisine over others.
These factors remind us of the complexities involved in drawing definitive conclusions from our analysis.
Given these limitations, we’re compelled to make a decision. Our goal is to reach a conclusion that reflects the truth in its entirety. Based on our analysis, it’s evident that Americans’ preferences for Italian and Mexican food are quite similar overall.
Both cuisines receive statistically similar ratings, averaging around 3.8 with a p-value suggesting no significant difference. However, there’s noteworthy evidence showing that Italian restaurants tend to garner more reviews than Mexican ones, with a p-value of approximately 0.05.
This suggests either a heightened interest in dining at Italian establishments or a tendency among Yelp users to review them more frequently.
Does this fully answer the question?
Not entirely. Yet, it represents our best effort to approach an objective answer using the available data.
LETS BE REAL CONCLUSION
Then, to sum up, realistically, it’s likely to be a tie because higher review counts typically indicate more patrons visited. Therefore, we arrive at a reasonable conclusion. If I were a betting person in Vegas, this is where my money would go.
CONCLUSION
Okay, to wrap everything up neatly, this is what encapsulates our approach.
This is the essence of the analytics paradigm — it’s about constructing a narrative that uncovers the truth and leads us to an objective conclusion.
Before you go!
- Stay tuned for more insights! Follow and subscribe.
- Did you see what happens when you click and hold the clap 👏 button?