A living documentation of our data
If you see mistakes, have follow-up questions, or want to suggest changes, please create an issue on the source repository.
Due to potential sources of error in the data, we chose to use a robust regression1 to model the 14-day linear trend. Robust linear models are well suited to this problem since they can account for outliers in the data. For all of the 14-day trend lines shown in the report, we used the statsmodels2 implementation of robust linear models with the Huber M-estimator3. The Huber loss results in a squared error for inliers and an absolute error for outliers (up to a constant factor and as specified by the threshold and scale parameter). We chose this function since it does not completely ignore the effect of outliers (like Tukey’s biweight) but instead just downweights their influence.
import statsmodels.api as sm
import numpy as np
def fit(X, y):
features = sm.add_constant(np.arange(len(X)))
rlm_model = sm.RLM(y, features, M=sm.robust.norms.HuberT())
model = rlm_model.fit()
We only create models for counties which have had at least 28 reported cases. For counties with low case counts, the fit of a linear trend may not be meaningful.
Due to a shortage of tests and backlog of results, the total number of COVID-19 cases may be severely underestimated. However, since at-risk or suggestively symptomatic individuals are prioritized first, the percentage of positive cases may also be an overestimate. Additionally, the biased sampling of who is tested may skew the demographics those diagnosed.
Due to inherent uncertainty in RT-PCR 4 as well as serological antibody 5 tests, even multiple repeated tests for a positive patient can give an erroneous diagnosis 6. For COVID-19, the most dangerous of these errors are the false negative tests 7, since these individuals may not receive required medical treatment but also continue to spread the virus.
In addition to the methodological sources of error above, reliable COVID-19 diagnoses are complicated by the temporal dynamics of the disease.8 Not only are there exogenous errors that can arise during sample collection (nose and throat swabs), especially if a location is understaffed for the amount of samples they need to collect, but specificity of detection can also fluctuate with the time since infection and the severerity of a case.
One of the earliest natural experiments for studying COVID-19 that occurred during the Diamond Princess cruise ship quarantine in Japan910 has shown that many asymptomatic individuals at the time of testing can in fact be positive and contagious. These silent spreaders thus can have an outsized effect on the spread of the virus and often evade testing.
In many of metrics, we use statistical methods to control for the size of the Texas population. That method is indicated by the phrase “Per Capita”, which indicates Per 100,000 people whenever you see it mentioned. For us, there were two major issues we encountered while determining how to adjust for the populations.
In the end, we opted to use the population estimate utilized and recommended by the U.S. Census Bureau’s own COVID-19 hub as any denominator when calculating a “Per Capita” rate of anything. That population data from the 2018 American Community Survey 5-year series lives here. In the following equations, you will see this figure referenced as \(\text{Population}_\text{ACS_2018}\).
The total cases, deaths, and active cases metrics were reported “as is” from the Johns Hopkins University dataset. Case Growth Rate chart was derived from the NYTimes time series data by taking the current day’s reported case count for Texas and dividing it by the previous day’s reported case count and converting it a percentage format, which can be represented using the following equation to calculate daily growth rate:
\[\text{Case Growth Rate}=((\dfrac{cases_\text{today}}{cases_\text{yesterday}})-1)*100 \]
We also filtered to start the chart at March 6th, because prior to that date, there was hardly any data available to generate a case growth rate. Once we had established the case growth rates for each day, we attempted to generate trend lines using a rolling 7-day average of Case Growth Rates in Texas.
| Metric | Equation |
|---|---|
test_per_capita |
\[\text{Test Per Capita}=(\dfrac{\text{COVID-19 Tests}_\text{tot}}{\text{Population}_\text{ACS_2018}})*\text{100,000} \] |
daily_test_per_capita |
\[\text{Daily Test Per Capita}=(\dfrac{\text{COVID-19 Tests}_\text{daily_tot}}{\text{Population}_\text{ACS_2018}})*\text{100,000} \] |
daily_test_pos_rate |
\[\text{Test Positive Rate}=(\dfrac{\text{+ COVID-19 Tests}_\text{daily_increase}}{\text{All COVID-19 Tests}_\text{daily_increase}})*100 \] |
All data visualized here is reported as is from the Texas Department of State Health Services (DSHS). No derived metrics were generated from this data for the charts seen here.
Hospital data shown here are derived metrics using hospital capacity data produced by the Texas Department of State Health Services (DSHS). They were calculated as follows:
| Metric | Equation |
|---|---|
pct_hospitalized |
\[\text{Pct. Hospitalized}=(\dfrac{\text{COVID-19}_\text{Hospitalized}}{\text{COVID-19}_\text{Active Cases}})*100 \] |
gen_bed_avail_rate |
\[\text{General Bed Availability}=(\dfrac{\text{General Beds}_\text{available}}{\text{General Beds}_\text{Total}})*100 \] |
icu_bed_avail_rate |
\[\text{ICU Bed Availability}=(\dfrac{\text{ICU Beds}_\text{available}}{\text{ICU Beds}_\text{Total}})*100 \] |
vent_avail_rate |
\[\text{Ventilator Availability}=(\dfrac{\text{Ventilators}_\text{available}}{\text{Ventilators}_\text{Total}})*100 \] |
The trends over time section just takes daily increases of each topic (cases, tests, and deaths) and maps out the daily increase of each while also calcualting rolling 7-day averages for each chart. As the charts indicate, the data sourced for them is the time series dataset from the New York Times.
All data visualized here is reported as is from the Bureau of Labor Statistics and Homebase. No derived metrics were generated from this data for the charts seen here.
Homebase is an incredible group that has provided their data for the benefit of small businesses they work service. Their dataset is derived from over 5,000 small businesses in Texas who utilize their services. We mention this because that context is critical to drawing meaningful “take aways” from their data. By “small business”, we mean that Homebase mostly serves businesses of 100 employees or less, which makes it an incredibly valueable asset for understanding small business dynamics. Second, Homebase’s metrics shown here are marked as “estimates”, because they do not reflect the entire universe of small businesses in Texas even though they have have a sufficient sample size to represent small business dynamics in Texas.
| Metric | Description |
|---|---|
Est. Local Businesses Open |
This represents the change in the number of businesses compared to the beginning of January, which is what Homebase uses as their benchmark to calculate the change figures. The number below shows the change relative to a week or ago. |
Est. Reduction In Hours Worked |
This represents the change in the number of hours worked by hourly employees compared to the beginning of January, which is what Homebase uses as their benchmark to calculate the change figures. The number below shows the change relative to a week or ago. |
Est. Hourly Employees Working |
This represents the change in the number of businesses compared the beginning of January, which is what Homebase uses as their benchmark to calculate the change figures. The number below shows the change relative to a week or ago. |
Coming Soon.
Robust Statistics, Peter J. Huber. John Wiley and Sons, Inc. 1981.↩
Statsmodels: Econometric and Statistical Modeling with Python, Skipper Seabold and Josef Perktold. Proceedings of the 9th Python in Science Conference. 2010.↩
Robust Estimation of a Location Parameter, Peter J. Huber. Annals of Mathematical Statistics. 1964.↩
Stability Issues of RT‐PCR Testing of SARS‐CoV‐2 for Hospitalized Patients Clinically Diagnosed with COVID‐19, Li, Yafang, Lin Yao, Jiawei Li, Lei Chen, Yiyan Song, Zhifang Cai, and Chunhua Yang. Journal of Medical Virology. March 26, 2020.↩
Test performance evaluation of SARS-CoV-2 serological assays, Whitman, Jeffrey D., Joseph Hiatt, Cody T. Mowery, Brian R. Shy, Ruby Yu, Tori N. Yamamoto, Ujjwal Rathore et al. medRxiv. April 29, 2020.↩
False‐negative of RT‐PCR and prolonged nucleic acid conversion in COVID‐19: Rather than recurrence, Xiao, Ai Tang, Yi Xin Tong, and Sheng Zhang. Journal of Medical Virology. April 9, 2020.↩
A case report of COVID-19 with false negative RT-PCR test: necessity of chest CT, Feng, Hao, Yujian Liu, Minli Lv, and Jianquan Zhong. Japanese Journal of Radiology. April 7, 2020.↩
Temporal dynamics in viral shedding and transmissibility of COVID-19, He, Xi, Eric HY Lau, Peng Wu, Xilong Deng, Jian Wang, Xinxin Hao, Yiu Chung Lau et al. Nature medicine. April 15, 2020.↩
Chronology of COVID-19 Cases on the Diamond Princess Cruise Ship and Ethical Considerations: A Report From Japan, Nakazawa, Eisuke, Hiroyasu Ino, and Akira Akabayashi. Disaster Medicine and Public Health Preparedness. March 24, 2020.↩
Public health responses to COVID-19 outbreaks on cruise ships—worldwide, February–March 2020, Moriarty LF, Plucinski MM, Marston BJ, et al. MMWR Morbidity and mortality weekly report. March 27, 2020.↩
In this September 2019 brief, the state demographer’s projections suggest definitive growth in Texas, regardless of the migration scenario, including zero net migration.↩
Even though the most raw numbers usually synced, the figures reflected were usually reported on a lag. For example, if the state publishes new numbers on the evening of May 1st, which is intended to reflect data for May 1st, then, on May 2nd, the numbers for May 1st get published in their datasets, even if the state went ahead and published those numbers on their own website.↩
This github issue has a clear explanation https://github.com/CSSEGISandData/COVID-19/issues/2185↩