First Things First: Days and Nights

This blog post focuses on the problem of days and nights in the everyday itinerary that Kate and Harmony are putting together, first from 1950-53 and now from 1947-60 (see their mini essay in Current Research in Digital History). When I joined Dunham’s Data last fall, I started by extracting patterns and any meaningful information out of this dataset to model stay lengths, that is, the number of nights in a row that Dunham spent in the same place. Firstly, we need to understand what days and nights look like from a data perspective.

Our dataset captures where Dunham started and ended every single day with two fields: “city1” and “city2,” respectively. The field city1 is mandatory. The field city2 is either empty or different from city1. Most days belong to the former case, meaning that Dunham stayed in city1 for at least one day. When city1 and city2 are different, this is interpreted as Dunham waking up in city1, travelling from city1 to city2 during that day, and spending the night in city2. The trip from city1 to city2 may include one or more stops in some cities (“transit cities”).

If we understand our dataset as a table, we can define the number of nights (N_NIGHTS) in a city as the count of rows with the same value for the column city1 in a row. The number of days (N_DAYS) also counts the number of rows that have the same value for the column city1 in a row and, optionally, the previous row if it has that same value for city2.

In the example provided in Table 1, we can observe that she stayed for two nights in city B (days # 2 and 3) and three days (days # 1, 2, and 3). Sometimes, Dunham and her company travel overnight and arrive in a new city in the morning. Therefore, we can’t assume they always arrive in a city the night before when this is not made explicit in the data. This is the case of city C, where we can tell they stayed for four days (days # 4, 5, 6, and 7) and three nights (days # 4, 5, and 6), but unlike B on day #1, we don’t know when they arrived at C.

From above, we formalized that stay length = N_NIGHTS and that N_DAYS = N_NIGHTS + 1. Transit cities are defined in opposition to stays—they’re cities in which the company stops for hours, but they don’t stay overnight. A transit city can be made explicit in the column transit or it can be a city1 whose stay length is equal to 0 nights. In Table 1, the company has stays in B, C, and E; D and F are transit cities; and we don’t have enough information about A.

Stays, stay lengths, and transit cities are key to trace Dunham’s everyday itinerary with absolute precision. This is particularly important for a project that taught me that every single piece of data matters here. Dunham’s Data isn’t like one of those big data projects where large amounts of data are collected automatically by bulk downloads or by querying databases via APIs. In such projects, a certain level of errors in the data or in the calculations is accepted, and so is some data loss if, for example, dirty records couldn’t be cleaned correctly or if manual correction isn’t a feasible option due to the size of the data. This is not the case for Dunham’s Data. From my training in statistics, I would argue that our results have a very high confidence interval because so much effort has been put in the manual creation and curation of the dataset.

After we figured out how to properly count days and nights, and defining what constitutes a stay based on them, I could start the analysis. For example, I wanted to look at the most exceptional cities, that is, the most visited cities to shed some light on Dunham’s preferred destinations. For that, I created two visualizations: a bar chart (Figure 1) and a swarm plot (Figure 2). Figures 1 and 2 plot complementary aspects of the most visited cities. Both figures chart the number of stays per city, but while the bar chart limits the information displayed to aggregated data such as the mean stay length or the sum of all stay lengths, the swarm plot shows individual data for each stay. As an illustration, Figure 1 reveals that Paris was the most visited city, both by number of visits (7) and by the total number of nights she spent in the seven visits (180). Figure 2 shows, instead, the seven stays in Paris, and reveals that three of them had significant lengths (29, 38, and 89 nights).

Figure 1. Top 15 most visited cities: number of stays, mean stay length, and total stay length (logarithmic scale).

Figure 2. Top 8 most visited cities: number of stays and individual stay lengths

Creating visualizations like these is an iterative process; I extract results from the data, the whole team discusses what’s meaningful for the project, what’s not, what’s missing, and then decide where to go for the next iteration. For instance, Figure 2 was created after we merged cities closer than 15 km (value chosen empirically), such as Los Angeles, Hollywood, and Beverly Hills, and also Port-au-Prince and Petion-Ville. The swarm plot is a useful instrument to visualize the number of visits and how many nights Dunham lingered in a city at a glance, but it is not informative about the dates in which these stays happened. That is, the image doesn’t say which stays happened first or if the first stay is longer or shorter than the subsequent returns, or if the difference in lengths is even relevant. For instance, we know that the stays in Paris are evenly distributed over time, New York City had a first long stay in 1950 and no relevant returns, or Los Angeles became important from April to November 1953, but none of this is displayed by Figure 2. To overcome this limitation, I began to build other kinds of visualizations, such as timelines, and they will be introduced in next posts.