How do we balance making data-driven decisions in government against our goals of limiting data collection and protecting resident privacy?
The City & County of San Francisco provides numerous services to its residents — everything from library cards to housing vouchers to public health services. Key to providing the right service to the right person is identifying who that person is. This is easier said than done. In order to ease the burden on residents, and make sure City services are available to all regardless of immigration or documentation status, the City doesn’t routinely collect unique identifiers like social security numbers or driver’s licenses. What we do collect across most services are people’s names, dates of birth, and sometimes contact and demographic information.
This can cause issues when identifying people across multiple systems. For example, one person’s information may be recorded across many different City systems as:
Jane Doe, 1/12/1990
Jayne Doe, 1/12/1990
Jayne Doe, 12/1/1990
Jane G Doe, NA
JG Doe, 1/12/1990
Linking these records based on an exact match (i.e. “Jane Doe” spelled exactly the same in each instance) would lead to the City treating the records above as separate individuals even though human review would indicate these are probably the same person.
Using only an exact match approach, we cannot easily determine how many services were accessed, whether a service was accessed more than once, or which services were effective. Imagine that “Jane Doe” was seen by an outreach worker for an overdose on the street and then transferred to the hospital. Once in the hospital, her record is now maintained by the Department of Public Health (DPH), and the street outreach team does not know the outcome of the treatment provided. This means they cannot measure how many lives are saved by their overdose interventions or what they could do to improve the intervention.
Linking information is vital for understanding whether people are getting the services they need and for understanding the “true” number of people that need city services. Do they have to apply multiple times?
At DataSF we use data science tools to help city departments link records.
The data science toolkit for record linkage
Record linkage, or data matching, is commonly used across many industries, and can substantially help city governments leverage data to improve services. These generally involve using data and statistics to identify when two records might be related and the likelihood that the records are in fact related.
There are a number of data science tools available to enable record linkage, including:
String Distances: Take the names “Jane” and “Jayne” — they differ only in that “Jayne” contains a “y.” We can mathematically compare how similar or different two words are. Using the Jaro-Winkler method, for instance, “Jane” and “Jayne” are 94% similar. (There are many mathematical methods to do this!) We can set an acceptable threshold to determine if two names are similar enough to be considered the same person. These cutoffs can be arbitrary. In the example above, this method would link names 1, 2, 3, and 4, but not 5. It’s more challenging to apply this technique when comparing dates.
Phonetics: Phonetic distances are similar to string distances, but instead of measuring differences in spelling, we measure how similar two words sound. Soundex is a common algorithm used to measure phonetic distances. Phonetic algorithms might be better than string distances given the City’s diverse population and the variability in spellings for non-English names. For e.g. Juan and John have the same soundex code but are only 70% similar in spelling.
Probabilistic Matching: These algorithms use string and phonetic distances of data elements like names and data of birth, and data matching on any other fields that may be present, such as phone number, gender, race, etc. These methods are sophisticated enough to handle missing data, so they can be used even if these other fields are not present in every record. By utilizing all the available information present, we can estimate the probability that two records belong to the same person. Every record is compared to others in the dataset (based on string distances, phonetic distances, or exact matches) and the probability that they are the same is estimated. While these methods are powerful, we still need to decide which thresholds or probabilities are acceptable for a match, and this requires a validated dataset with matched individuals. Using this validated “training” data, we can estimate the error rates at different thresholds.
Machine Learning: These powerful algorithms learn matching patterns from a training dataset (a dataset that has already been matched). Similar to the methods above, every record is pairwise compared with others in the dataset, and a model is fit to minimize errors in predicting correct matches. Often, a large training dataset is needed to create a model that performs well across the whole dataset, which can be a significant limitation of these methods.
Using record linkage tools to improve services in San Francisco
In collaboration with other departments, DataSF has implemented probabilistic record linkage to improve service delivery in several instances within the City:
Matching COVID-19 cases to the California Vaccine Registry (DPH): When COVID-19 vaccines were first released, the city needed to know how many vaccinated individuals were still testing positive for COVID-19 and how many were being hospitalized, as well as how these numbers compared to those for unvaccinated people. Case information is based on confirmed positive laboratory tests reported to the city, while vaccination data comes from the California Immunization Registry. There are no common identifiers to link these two data sources. The Department of Public Health, in collaboration with DataSF, matched positive test results to vaccine records using first name, last name, date of birth, phone number, and email address. The matching process relied on a probabilistic algorithm based on these methods. The archived dataset is here. With this matching, we were able to understand the efficacy of COVID-19 vaccines in San Francisco. We could answer critical questions such as: How many people who were vaccinated tested positive over the iterations of the vaccine and Covid-19 variant? How many vaccinated individuals were hospitalized, and how many died?
To estimate the COVID-19 case rates for vaccinated individuals in SF we needed to match individuals who tested positive in San Francisco to all individuals vaccinated in 9 Bay Area counties
Understanding applicant journeys for affordable housing: Applicants for affordable housing in San Francisco use the DAHLIA web application to view listings and submit applications. Currently an applicant is not required to be logged in and a single person can apply for multiple affordable housing listings. While City staff had information on the number of applications submitted, it was difficult to determine how many unique individuals were applying, how many times they applied, and how long it took them to get housed. In collaboration with the Mayor’s Office of Housing and Community Development, DataSF developed and implemented a probabilistic matching algorithm that groups applications by similarities in applicant information. This algorithm now reconciles new applications daily, and we know that about 115,000 unique individuals have submitted about 1 million applications since 2017. We can count applicants by demographics and understand more about who applies for which housing units.
A picture illustration showing reconciled and unreconciled applications from San Francisco’s DAHLIA Affordable Housing application web portal
Linking data from street response outreach teams: There are more than nine street teams that offer services to unhoused individuals in San Francisco. Historic data privacy regulations required teams to collect and store data in multiple systems, inhibiting the City’s ability to refer people to the right services, reduce duplicative services, and effectively measure program impact. Using a new data privacy framework aimed at getting more people into housing and shelter, DataSF collaborated with the Mayor’s Office of Innovation to develop a matching algorithm that links individuals across street teams. We use first names, last names, dates of birth, Social Security numbers, and medical record numbers (where available) to generate a unique identifier for unhoused individuals. With this linkage, we can measure how many encounters each individual has across teams. Matching allowed us to identify a core group of high utilizers across services, and develop targeted interventions for these individuals based on their history. This has already had some success getting folks off the street.
Ethical considerations in using data linkage
Record linkage is a crucial way to help the City make better use of the data it collects. Having robust data on services at the individual level is key to better decision-making and better outcomes for residents.
However, there are some important privacy considerations with record linkage. For example: do residents want governments to know all the services they’ve used? How do we make data useful while still protecting privacy?
There are some best practices to follow to ensure that we protect resident privacy while still using data effectively. For example, after data has been linked, we can scrub personally identifiable information (PII) from analytics datasets that are subsequently used by analysts to derive insights. For extremely sensitive datasets like health records, there are newer methods to conduct linkage after identifiers have been encrypted by using a technique called “hashing”.
In general, we should also attempt to measure any biases introduced in our analyses as a consequence of linkage. Is the quality of linkage the same across races and ethnicities and are we counting people fairly? In the affordable housing project we found that Asians and Latinos had lower quality matches compared to other races and ethnicities and this may lead to a small risk of overcounting these groups. DataSF uses an established Ethics & Algorithm Evaluation Toolkit to help identify and assess the risk, potential impact, and benefits of using algorithms to support service delivery. Using this toolkit or something similar is highly recommended before using any of these tools, especially if they will be used in an automated way or to support how services are delivered.
Helping others to use data science to link records
The examples from COVID-19, affordable housing, and street response above show just a few ways that the City can make even better use of the data it has to improve services while protecting privacy and reducing burdens on residents. While linking records in the City has generally focused on a specific service at a time, DataSF is working with partners to share common approaches and tools to empower even more City departments to make use of their data to improve services. If you’re a San Francisco City department interested in learning more about using these tools, you can find out more here and get in touch using this interest form.