arrow checkmark code cross email facebook magnifier pdf phone plus twitter user youtube


Covid-19: Anticipating the advance of the pandemic using Google – Francisco Gallego (translation)

Article written by Jaime Casassus, UC Economics Institute professor associate, Francisco Gallego,  UC Economics Institute professor associate and Scientific director of J-PAL LAC and Rodrigo Icaran, Magíster student at UC Economics Institute. Link Original article.

One of the major challenges in dealing with the pandemic is to identify new infections in time. On June 12, for example, the front page of  El Mercurio newspaper titled "Four out of ten patients with Coronavirus did not receive the result of their PCR examination while they were contagious”. This is an extremely worrying fact, on the one hand, because – for patients – the delivery of results with such a level of delay becomes practically an anecdotal fact. On the other hand, if the most important information for decision-making is received with such delay, it is quite unlikely – and almost impossible – to take necessary measures in due time to appease the progress of the disease. In other words, policymakers could be making today the decisions that were adequate 14 days ago. In this column we briefly suggest new, albeit preliminary, methods for obtaining  with less delay information on the evolution of contagions in our country.

A recent study by Harvard Medical School scholars, suggests that the disease may have already been circulating during 2019[1]. The authors come to this conclusion mainly by observing that during the second half of 2019, in some cities in Asia, there was an explosive increase - compared to previous years -  of Internet searches on symptoms related to the disease: fever, cough, diarrhea, among others. With this motivation and considering the delay in the delivery of the PCR test results in our country, we tried to study whether a similar methodology could be useful to monitor the evolution of new cases in Chile[2].

Intuitively we think that a person who recently begins to experience the symptoms of Coronavirus should be more likely to search on Google terms such as “test”, “PCR”, “smell” and other keywords related to both the symptomatology of the disease and the demand for tests related to it. Consequently, if more people were being infected at any given time, there should be an increase in the number of related searches. Apparently, this relation proves to be true in practice. The attached Figure 1 shows the evolution of the number of Google searches for the terms mentioned above for Santiago Metropolitan Area. The original data is obtained on a daily basis, but for this analysis a 7-day moving average is considered in order to eliminate possible variations due to weekends. Although it is clear that the terms have different degrees of popularity, a strong correlation can be observed between their searches.

Based on the data above, we constructed an index called “GTCovid Index”, which consists of a linear combination of Google searches of the terms considered in this analysis. The weightings for the index correspond to the first major component of the standardized data[3]. Figure 2 shows this index for the Santiago Metropolitan Area and its original components. The index shows a significant increase in the popularity of these terms during the first half of May, then a minor increase until the beginning of the second week of June, ending with a sustained drop starting from that date.

Since the terms considered are related to the Coronavirus symptoms, it comes naturally to analyze whether the index has any short-term predictive power over new confirmed cases, which in turn would allow us to know whether the number of infected people is increasing or not in the region. While this prediction is of little use for those who perform Google searches in case they already have the virus, knowing this information in time would be useful to take measures to avoid further contagion in the rest of the population. The predictability analysis could also be performed on other variables such as the rate of positivity controlled by changes in tests performed, visits to emergency services, use of hospital ICU beds, or number of deceases by COVID-19. However, in this context, it seems more relevant to try to predict the number of new infections, since they are the first link in a chain of subsequent events such as those mentioned above.

Figure 3 shows the evolution of the GTCovid Index and the number of new infections confirmed with PCR for Santiago, considering the moving average of the last 7 days and standardized data. As we well know, the number of new cases increased considerably from the second week of May, then continued to grow at a lower rate between the last week of May and mid-June, to finally decline starting  from that date. The similarity between the two curves is evident, as is the approximately one week anticipation of cases’ surge constructed from the Google searches index. The high correlation and lag of both time series suggests that the number of searches on Google could allow a reasonable approximation in the progression of the disease in the population, with a much lower lag than those of the PCR tests.

The conceptual relationship between the probability of COVID-19 infection with the interest in learning about its symptoms (for example, loss of smell) and how to verify its existence (for example, through a PCR test), allows us to reduce the possibility of a simple coincidence in our results. Alternatively, when we use terms related to the pandemic but not directly related to the symptoms of the virus - such as “coronavirus”, “COVID”, or even “quarantine” - it is impossible to generate information that correctly anticipates the amount of new infections. This helps us rule out that the observed correlation corresponds to a spurious phenomenon.

To complement the above and validate the procedure, we carried out two statistical exercises to formally assess the predictive capacity of the GTCovid Index and measure its ability to forecast increases in new COVID-19 cases. The first consists of a Granger temporal causality test, which strongly rejects the hypothesis that our index does not cause the number of new cases in Santiago. In reviewing reverse causation, it cannot be refuted that new cases do not, in Granger's sense, cause the Google index.

To measure the predictive capacity of the index we considered a linear model, which we call GTCovid model, and which has as dependent variable the average of new contagions of the last 7 days, while as regressors it uses lags of 7 and 14 days of the same variable of contagions and the GTCovid Index. This model is compared to one that uses only 7 and 14 days of new contagions to predict new cases for the current week. This autoregressive model is called "AR (2)". Since it is important to measure the models’ predictive capacity , a 6-week moving window is considered to estimate the models and then used to perform the out-of-sample prediction for the following week. For the following week, the model is recalibrated with a new window of the past 6 weeks.

Figure 4 shows the prediction of the GTCovid model and the AR (2) model, as well as the curve of new contagion cases. The first prediction is for mid-May because it takes 6 weeks for the sample plus 2 more weeks for the lags. Both models are able to predict the sharp increase in new cases that happened in May, however, only the model that considers Google searches was able to quickly anticipate the trend change that occurred at the end of that month. The AR (2) model needed at least one more week to correct the trend. Something similar is true of the decline in cases that began in mid-June. The GTCovid model correctly anticipated the fall in new cases, while the AR (2) model had to wait at least a week to make the correction.

Figure 5 presents the weekly predictions for each model versus the new realCoronavirus cases. The GTCovid model predictions are always around the line at 45 degrees, which speaks highly of its adjustability. The AR (2) model predictions are more scattered and are mainly below the 45-degree line, which suggests some positive bias for this particular period. Including information from Google searches reduces the root-mean-square error (RMSE) of out-of-sample predictions from 0.43 to 0.15 and its  meanabsolute  error (MAE) from 0.35 to 0.12.

Although, in principle, we believe that methods such as those described above can be used as a secondary input to monitor the progress of the pandemic and, eventually, anticipate a possible outbreak of the virus, it is evident that they should be interpreted with caution[4 ]. Factors such as too heterogeneous internet access, among others, could lead to wrong conclusions. For example, it could eventually happen that the virus begins to transmit strongly in areas without internet access, which obviously would not be detectable through Google searches. It may also be necessary to identify other behaviors or searches that help to predict an outbreak of the disease (for example, related to ways to avoid measures of social distancing). In any case, exercises such as this show the importance of the use of unconventional data in public policies, an area in which there still seems to be a long way to go[5].

[1] Nsoesie, Elaine Okanyene, Benjamin Rader, Yiyao L. Barnoon, Lauren Goodwin, and John S. Brownstein. Analysis of hospital traffic and search engine data in Wuhan China indicates early disease activity in the Fall of 2019 (2020)

[2] Other scientific and press articles have considered this topic in other countries. For example: Dukic, Vanja, et al. Tracking Epidemics With Google Flu Trends Data and a State-Space SEIR Model. Journal of the American Statistical Association (2012) and Stephens-Davidowitz in an opinion column in The New York Times (2020)

[3] This methodology has already been applied to other prediction problems in other spheres in the case of Chile and other countries. See, “Nowcasting with Google Trends in an Emerging Market”, YAN CARRIÈRE-SWALLOW and FELIPE LABBÉ. Journal of Forecasting, 2013.

[4] Note how internet search indicators have been used in various disciplines of social sciences to identify various phenomena, behaviors and to identify trends in real time.

[5] In this line, the work carried out in several countries using wastewater analysis can be mentioned. See, Ampuero et al. (2020) "SARS-CoV-2 Detection in Sewage in Santiago, Chile - Preliminary results".