How we come up with a data science, analytics, or data based research project is the subject of both common lore and debate, or at least debate as defined by dissenting behavior. Common lore dictates that we start with the business question that we are most concerned with irregardless of available data and methodologies and then proceed to identify the data and methods needed to solve the problem and then execute the project. In fact, this is almost never the case.
In the real world, we operate under constraints. If it really were the case that we could do anything we wanted, we would be flying around in magic cars and I would be doing more wonderful things than writing this blog post. In my world, data availability and data quality are the key constraints. Just like we can’t fly around in magic cars, we frequently can’t answer our most pressing business questions using our available data.
In fact, the most important inventions and discoveries are rarely the result of a top-down approach. When humans first learned to make tools from bronze or iron, it wasn’t because a bunch of MBA’s gathered around a conference table in front of a Power Point and told those engineers to go look at those rocks. Instead, people experimented hands on with materials in their environment by trial and error until they made tools to solve their day-to-day tasks.
Similarly, the most novel data based projects are generally created by analysts exploring data that is actually available and trying to imagine what problems it can solve. There is no doubt that the top down approach can hone and modify existing capabilities and redirect them to better serve organizational goals. But generally the top down approach is better at reinventing the wheel than it is at inventing it in the first place.
Regardless, especially when only so much data is available, it’s often more efficient to start with the possible than the hopeful. Data constraints are incredible for people working on personal projects with public data. Valuable datasets are rarely available to the public as those that create them naturally want to capitalize on this value themselves. Generally, up to date datasets are valuable. So what information that is usually available is months or even years out of date because:
- The organizations that create it only disseminates the information to the public when it is out of date and signficant revenues can no longer be generated with it.
- The organization wants its own analysts to poor over the data extracting any value that’s there before anyone else can.-
- Many public datasets are produced by the government and the government has its own associated researchers who seek to benefit from the data before anyone else does.
- Public datasets produced through the government usually follow dated etl processes that span strings of organizations and data takes a very long time to complete this process.
This is the situation I faced when I sought out a project that would give me an opportunity to work with time series data and would also be related to subjects I’m personally interested in outside of work. I was interested in building my time series skills because I’ve mostly worked with cross-sectional and panel data in the past and time series are comparatively difficult to work with. I’m also interested in outdoor recreation and issues in rural areas. I work with subjects that other people tell me to every day so my personal project is going to be something I care about.
I was drawn to the GoogleTrends dataset because you can analyze a virtually unlimited set of search terms and the data is almost immediately recent. Since the value of time series analysis is closely tied to future forecasts, using out of date data is especially useless. Google has an incentive to make this data available as it sells advertising slots for search terms. But this data is, thankfully, not limited to advertisers. I started searching this data for a wide range of terms that were related to outdoor recreation, rural areas, terms related to my occupation, and a bunch of other topics.
I started noticing interesting things about the search interest for towns and places I knew about. Specifically, this data was far more seasonal than I expected. There were also interesting trends in which places were increasing in interest and which weren’t. I realized that search interest for places closely tied to tourism might be closely tied to the vitality of these places as well. In fact, I have subsequently discovered that aggregate search interest for Northern Michigan is closely tied to tourism spend.
These observations through basic exploratory analysis were the beginning of the project. And from that point, the project took many turns that I couldn’t have predicted. What started off as plans to implement an arima model, instead resulted in a basic maching learning model with the data heavily transformed based on relationships that I observed. For instance, the trends of most of these series and how they relate to weather variables differ between different seasons. This is hard to capture. In fact, my current model has disadvantages I’m working to correct in the future.
The rest of the story is found in my various installments for this project that I shared on Linked, this webpage, or are available upon request.