How to use the World Inequality Database (WID)?

Accessing and utilizing data of the WID.world database.

inequality
research
WID
open-data
Author

Peter Baumgartner

Published

July 27, 2024

Modified

July 27, 2024

Logo of the World Inequality Lab (Screenshot). The WIL is a global research center studying inequality and public policies promoting social, economic, and environmental justice.

Logo of the World Inequality Lab (Screenshot). The WIL is a global research center studying inequality and public policies promoting social, economic, and environmental justice.

Research on Social Inequality

I plan to write blog articles about social inequality in Austria, my native country. To be innovative, I want to avoid repeating what other people (journalists, scientists) were saying (claiming) but to illustrate my thinking backed by original data. But where does the data come from?

Some Data Resources for Inequality Research

Resource 1 : Some Data Resources for Inequality Research

WID & A Brief History of Equality

Understanding the use of all these data resources is a complex enterprise. As a starting point, I will concentrate on the World Inequality Database (WID) because there exist several monographs where these data are used and interpreted.

As a first step, I am going to follow the arguments of one book that I will use as advanced organizer. I am referring to “A Brief History of Equality” (BHE), which is only 274 pages, is a kind of executive summary of previous books by Thomas Piketty: “Capital in the Twenty-First Century” (816 pages) and “Capital and Ideology” (1093 pages) (Piketty 2020; Piketty and Goldhammer 2017).

Lacking Graphs about Austria

Most of the time, BHE uses global, regional, or data from big, respectively ‘important’ countries like China, France, Germany, India, United Kingdom, United States, etc. Data on Austria and its interpretation are missing from the mentioned book and most scientific publications referenced on the website. An exception is the working paper “Distributional National Accounts (DINA)” for Austria, 2004-2016 by Jestl and List (2020). But this paper partly uses different indices and different concepts (for instance, the Gini coefficient).

On the WID website there are two other articles where at least Austria is shown in some graphs: You will find five figures with Austria data in “What’s New About Income Inequality in Europe (1980-2019)?” by Morgan and Neef (2020) and other two figures in “How Unequal Is Europe? Evidence from Distributional National Accounts, 1980-2017” by Blanchet, Chancel, and Gethin (2019).

So, creating figures with Austrian data is not only an excellent exercise but the result could also increase our knowledge about the mechanisms and patterns of social inequality.

Using the WID website

Key resources

To get started effectively, one has to overcome a paradoxical situation: The more detailed the methodological guidelines, the more difficult it is to find an appropriate entry point to get started.

Resource 2 : Methodological Guidelines & Other Resources

  • Distributional National Accounts (DINA) Guidelines: This is the central resource for understanding the methodology behind the decomposition of national accounts and the inequality series for income and wealth. The purpose of these Distributional National Accounts (DINA) guidelines is to present the concepts, data sources, and methods used in the World Inequality Database (WID). It is a long document with 186 pages (Blanchet et al. 2024).
  • Codes Dictionary (Code Book): It describes the general structure of the World Inequality Database (WID). It explains how information is organized within it and how to retrieve and interpret its content. This document is invaluable for getting familiarized with the construction of WID variables and the structure of the database (World Inequality Lab 2024). Converting the web page into a PDF results in a document of 18 pages.
  • Summary Table: It summarizes all available WID.world data. The search tool can look for specific countries, variables, or WID.world codes. It has currently 6253 rows and 16 columns. The summary table is an efficient starting point because it allows for a view of available data, which one can then download for more detailed analysis.
  • Computer Code: On the GitHub platform, you can inspect and fork the repository that contains all the code used to create the entire database from both the researcher’s data and official sources. It consists of 98% of STATA code, the rest (only 2%) is written in R. 1
  • Data: There are two ways to download data:
    • Direct Download: The interactive page allows one to select the required data and download it in the format that best suits one’s needs.
    • Advanced Interfaces: For special requirements, one can access the data with STATA or R interfaces. In the case of the R interface, one can download from GitHub the R {wid} package (Blanchet 2024a).

It is a lot of work to read all the necessary information before starting. Instead of using this (tedious) sequential approach, I will try an interactive procedure, iteratively changing between theory (methodological guidelines) and practice (using R to replicate the figure).

Replicating income distribution

Figure to replicate

As my test bed, I will use Figure 7 of BHE (Piketty 2022, 46).

alt-text
Figure 1: Figure 7 of BHE, taken from the original source on Piketty’s website.

Interactive Replication

As a first approach, I will replicate Figure 7 of BHE with the provided interactive tools of the WID. Here is a good starting point for income inequality.

alt-text
Figure 2: Choosing France as “Country” and using the default values for income inequality (top 10% share and bottom 50% share)

From here, I will show screenshots for every step necessary to replicate Figure 7 of BHE, but here in this post, it is Figure 1. If you click on one of the small images below, it will open the figure in the bigger original size, and you can navigate back and forth between the figures in the table below.

Figures 3-6
alt-text
Figure 3: Choosing France as “Country” and using the default values for income inequality (top 10% share and bottom 50% share).

alt-text
Figure 4: Enlarge the time span of the data from 1947 to 1820 with the bottom slider.
 

alt-text
Figure 5: Hiding the extrapolations (dotted lines) by choosing ‘More option’ on the top right corner and turning on the radio button.

alt-text
Figure 6: Check if ‘smooth lines’ is active. It should be turned on because it is the default value.
 
Figures 7-10
alt-text
Figure 7: Change the limits of the Y-axis from 0 to 60%. This is different than in the original. See my comments later in this article.

alt-text
Figure 8: Compared with the original, we still will need a third time series for the share of the middle 40% of the population.

alt-text
Figure 9: Click on the bottom of the left green navigation bar to open ‘More Indicators’ to include custom indicators in the graph.

alt-text
Figure 10: Choose ‘Income Inequality’ to get the option to include a preconfigured percentile into the graph.
Figures 11-14
alt-text
Figure 11: Scroll down to select the radio box for the ‘Middle 40%’ percentile.
 

alt-text
Figure 12: Confirm the default values. I will explain the different options in a later post.
 

alt-text
Figure 13: Our last action adds the share of the middle 40% to our graph, as was seen in Figure 8.
 

alt-text
Figure 14: Hovering with the mouse over the graphs of the interactive website, we can get tooltips with the values for every year.

Differences between original and replication

If we compare both figures side by side, we will see a significant difference. Our replication differs from the original in the years before 1900. For instance, in our figure, the people with the top 10% of income share about 60% of the total income, in contrast to the 50% share in the original graph. I need to find out where these differences come from. There may have been a correction of the computed extrapolations in an update after the book publication. In any case, this difference is not essential here. Remember, the primary purpose of the exercise is to understand the patterns of the database structure — and not (at least at the moment) to get correct figures.

Figure 15: Figure 7 of BHE, taken from the original source on Piketty’s website. The shares for the top 10% are lower. It is slightly above 50% of the total income, whereas it is about 60% in the replication. On the other hand, the share of the middle is 40% lower than that of the replication.
alt-text
Figure 16: The result of my replication is showing higher shares for the top 10% and accordingly lower shares for the middle 40% in contrast to the original figure. However, the level of the share of the bottom 50% of the population is approximately the same in both figures.

There are other differences between the two pictures, but they are more of cosmetically nature:

  • The choice of colors for the various time series is different.
  • The labels of the X- and Y-axes are different. The original is labeled every 20 years and contains a vertical line every 10 years, whereas the interactive replication has a 25-year grid. The Y-axis in the replication is missing the %-sign.
  • In the original, the time-series have different symbols in the data points (circle, triangle), which is vital in black & white (book) publications.
  • The legend is in the original at a prominent place in the middle of the graph, whereas it is in the interactive replication at the bottom.

Display the results

Until now, I have used manually taken screenshots. But there are also several possibilities to display results offered by the WID.

Download a JPG-file

alt-text
Figure 17: Income inequality, France, 1820-2022 download as JPG from the WID site and integrated into the Quarto text.

Note that there are now fewer labels on the X- and Y-axis. The legend’s appearance has changed somewhat, and some information that is irrelevant in a static view is removed.

Copy the URL

Another option is to copy the URL and integrate it as a link into the Quarto text. It opens up a web page with the figure and all the WID menus in your default browser. Others can explore the interactive graph or create their own figure from here.

See the example: Income inequality, France, 1820-2022

Embed as iframe

Another possibility is to embed the graph into an iframe.

Caution 1: Embedding WID iframes has some drawbacks

The legend overlaps the lower part of the graph. You have either to plan during the construction of the graph to prevent this drawback or (better) to adapt the code in the URL itself.

Besides, there is also an annoying cookies-consent message where the link to ‘learn more’ is wrong by pointing to another graph and not to the cookies policy.

The code in the above iframe is:

<iframe width="800" height="400" src="https://wid.world/share/#1/countrytimeseries/sptinc_p90p100_z;sptinc_p0p50_z;sptinc_p50p90_z/FR/2015/eu/k/p/yearly/s/true/9.4765/70/curve/true/1820/2022"></iframe>

It turned out that besides changing the delimiter for the axis from /9.4765/70/ to /0/62/ it is also necessary to enlarge the iframe’s height to position the legend lower. With trial & error experimenting, I got a good result after changing the height from 400 to 500. Here is the adapted iframe where you can see the full figure and also the slider to change the time period of the data:

Getting codes for indicators

By far, the most important result of our replication is that the URL contains the code for used indicators, areas, and percentiles.

https://wid.world/share/#1/countrytimeseries/sptinc_p90p100_z;sptinc_p0p50_z;sptinc_p50p90_z/FR/2015/eu/k/p/yearly/s/false/9.4765/70/curve/true/1820/2022

This is very helpful because it is now much easier to find the code explication in the Code Dictionary or the Summary Table. Let’s analyze the example of the sptinc_p90p100_z code:

  • s: The first letter s determines the code series. s means “share” and contains a fraction between 0 and 1.
  • ptinc: The next five letters are the code for the series concept. This is an essential help because a massive amount of five-letter codes exist. More than half of the 38 pages of the PDF version of the code dictionary lists five-letter codes and their meaning! Searching ptinc in the online Code Dictionary with the browser search tool results in “pretax national income”.
  • p90p100: This one is easy. p90p100 is the top 10%, the percentiles characterizing people with an income equal to or higher than 90% of the population. Or p0p50 is the bottom 50% etc.

Additionally, we have the information from the legend that the units are adults, and the income is split equally between couples or other adult members (= 20 years or older) of the household. Therefore, we can add:

  • 992: The age code description has three numbers. 992 stands for a population of individuals over age 20.
  • j: The last letter is the unit description. j means equal-split adults.

With these data, we can now look for the corresponding codes in the database and filter the database to get only the required information. The variable name is, in our case:

\[ \begin{align*} s + ptinc + 992 + j = sptinc992j \end{align*} \tag{1}\]

If we look for this complete code in the Summary Table, we can read in the column “Simple description of the variable” the following explanation:

Pre-tax national income is the sum of all pre-tax personal income flows accruing to the owners of the production factors, labor and capital, before taking into account the operation of the tax/transfer system, but after taking into account the operation of pension system. The central difference between personal factor income and pre-tax income is the treatment of pensions, which are counted on a contribution basis by factor income and on a distribution basis by pre-tax income.

Caution 2: Explaining just the procedure, not the concepts

Remember that I am not discussing the concepts behind the different parameter codes. I just want to stress how to find the code for a required indicator here. Which concepts to apply and what indicator is appropriate for a specific research question will be covered in other articles.

R Code

There is with {wid} a special R package for downloading data directly from WID (Blanchet 2024b). Read the instructions on the GitHub repo to get install information.

The R {wid} package should have a vignette with examples of its usage. But unfortunately, all the links to it are broken or not available except as .Rmd file in the GitHub repository itself. I have downloaded and rendered the vignette as a PDF for your convenience.

Plotting data with R

The central function of the {wid} package is wid::download_wid(). It has the following parameters:

Table 1: Description of the arguments for the `wid::download_wid()` function
Name Type Default Description
indicators character all Codes corresponding to indicators to retrieve
areas character all Areas (mostly countries) for which to retrieve the indicators
years numeric all Years for which to retrieve the indicators
perc character all Percentiles (part of the distribution)
ages numeric all Age groups (adults, all ages, elderly, etc.)
pop character all Population type (individual, households, tax units, etc.)
metadata logical FALSE Fetch metadata too? (eg. sources, etc.)
verbose logical FALSE Display messages showing progress?
include_extrapolation logical TRUE Include data based on extrapolations / interpolations

Download data

R Code 1 : Download data with wid::download_wid

income_france <- wid::download_wid(
    indicators = "sptinc", # Shares of pre-tax national income
    areas = "FR", # France
    perc = c("p0p50", "p50p90", "p90p100"), # Bottom 50%, middle 40% and top 10%
    ages = 992, # individuals over age 20
    pop = "j" # equal-split adults
)

save_data_file("WID", income_france, "income_france.rds")
Listing 1: Download data for the shares of France’s pre-tax national income.

(For this R code chunk, is no output available)

Display some random data

R Code 2 : Display some random data imported with wid::download_wid

data <- paste0(here::here(), "/data/WID/income_france.rds")
income_france <- base::readRDS(data)
my_glance_data(income_france, seed = 123)
   obs country   variable percentile year  value
1    1      FR sptinc992j      p0p50 1820 0.1209
2   14      FR sptinc992j      p0p50 1923 0.1375
3  118      FR sptinc992j     p50p90 1910 0.3524
4  179      FR sptinc992j     p50p90 1975 0.4765
5  195      FR sptinc992j     p50p90 1991 0.4694
6  229      FR sptinc992j    p90p100 1880 0.5932
7  244      FR sptinc992j    p90p100 1927 0.4699
8  299      FR sptinc992j    p90p100 1982 0.2994
9  306      FR sptinc992j    p90p100 1989 0.3319
10 339      FR sptinc992j    p90p100 2022 0.3479
Listing 2: Display some random data imported with wid::download_wid, including the first and last row.

Inspecting the data, we see:

  • Our dataset has 339 records.

  • All presented records have an area code FR, which stands for France.

  • All presented records have the variable name sptinc992j. This conforms to our previous calculation.

  • We see three different values for percentiles in the randomly displayed data: p0p50, p50p90, and p90p100 representing the income of the bottom 50%, the middle 40%, and the top 10% of the population.

  • Our random data show a period from 1820 to 2022.

  • The quantity for each record contains the column value.

Overall, The result looks nice, but we must check the data more thoroughly. Is there really only one type of variable? 2

Display data structure

R Code 3 : Data structure for the France data for pre-tax income (1820-2022)

skimr::skim(income_france)
Data summary
Name income_france
Number of rows 339
Number of columns 5
_______________________
Column type frequency:
character 3
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 2 2 0 1 0
variable 0 1 10 10 0 1 0
percentile 0 1 5 7 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1964.23 37.08 1820.00 1938.00 1966.00 1994.00 2022.00 ▁▁▆▇▇
value 0 1 0.33 0.12 0.12 0.21 0.34 0.45 0.61 ▇▂▇▇▁
Listing 3: Using the {skimr} package to display the data structure for the France pre-tax income data 1820-2022

Here, we see that no data are missing, that we have only one value for the variables country and variable, and exactly the required three values for percentile.

We also see from the small histogram for year that we do not have many data points for more of the first half of the time.

Display the national income figure

R Code 4 : Pre-tax national income inequality in France, 1820-2022

ggplot2::ggplot(income_france, 
    ggplot2::aes(x = year, y = value, color = percentile)) +
    ggplot2::geom_line() + 
    ggplot2::geom_point() +
    # ggplot2::geom_smooth(span = .9) + # for smoothing
    ggplot2::ylab("share of income") +
    ggplot2::scale_color_discrete(
        labels = c("p0p50" = "bottom 50%", "p90p100" = "top 10%", "p50p90" = "middle 40%")
    ) + 
    ggplot2::ggtitle("Pre-tax national income inequality in France, 1820-2022") +
    ggplot2::theme(legend.position = "bottom")

## save plot just once
# ggplot2::ggsave(filename = "BHE-sptinc992j_fr.png")
Listing 4: Pre-tax national income inequality in France, 1820-2022

Comparison of my result with the original Figure 7

Figure 18: Figure 7 of BHE, taken from the original source on Piketty’s website.
alt-text
Figure 19: Resulting figure after data download and plotting with R

We are pretty near if we compare the above plots. Besides the mentioned difference in the magnitude of the top 10% and middle 40% before 1900, there are the following differences:

  • title
  • grid for x- and y-axis
  • symbols for the data points
  • position of the legend
  • text of legend
  • the smoothness of the time series

However, these differences are all solvable with the appropriate R code. I leave it here because the purpose of this article was to work with WID (and not to replicate figure 7 of BHE exactly.)

Conclusion

I have shown procedures for working with the inequality data of WID.

  • One approach explores the interactive website on wid.world and tries to create the desired graph. There are several ways to use the resulting figure:
    • Downloading the picture as JPG
    • Embedding a link to the interactive graph on your website
    • Embedding the result as iframe in your website
    • Exploring the code in the URL and using it for the second approach (downloading the required data with R)
  • The second approach uses the {wid} R package, especially the function wid::download_wid(). Knowing the required codes for the download function is essential. The first approach could help find the correct code composition in the WID code book, the Codes Dictionary or in the Summary Table.
Important 1: How-to procedures are not enough

The post only describes “how-to” procedures for WID and does not discuss the available concepts appropriate for specific research questions. I plan to describe and reflect on concepts for inequality research in later blog posts.

Appendix

Glossary
term definition
Advanced Organizer Advance organizers are pedagogic devices that bridge the gap between what learners already know and what learners need to know. (<a href="https://link.springer.com/referenceworkentry/10.1007/978-1-4419-1428-6_157">Encyclopedia of the Science of Learning</a>)
BHE Piketty, T. (2022). A Brief History of Equality (S. Rendall, Trans.). Harvard University Press.
DINA Distributional national accounts (DINAs) aim to gather information on the distribution of the net national income and to explore it over time. ([Distributional National Accounts (DINA) for Austria, 2004-2016](https://wid.world/document/distributional-national-accounts-dina-for-austria-2004-2016-world-inequality-lab-wp-2020-23/))
iframe iframes allow you to embed HTML documents inside another HTML document. They offer a seamless way to integrate content from one source into another, enabling developers to create more dynamic and interactive web pages. ([dev.to](https://dev.to/joanayebola/what-is-iframe-and-how-to-use-them-1c63))
JPG JPEG (/ˈdʒeɪpɛɡ/ JAY-peg, short for Joint Photographic Experts Group is a commonly used method of lossy compression for digital images. [(Wikipedia](https://en.wikipedia.org/wiki/JPEG))
National Accounts National accounts are a system of accounts and balance sheets that provide a broad and integrated framework to describe an economy, whether a region, a country, or a group of countries such as the European Union (EU). For internationally comparable national accounts this system needs to be based on common concepts, definitions, classifications and accounting rules, in order to arrive at a consistent, reliable and comparable quantitative description of an economy. National accounts provide systematic and detailed economic data useful for economic analysis to support the development and monitoring of policy-making. [eurostat - Statistics explained]
Percentile The set of divisions that produce exactly 100 equal parts in a series of continuous values, such as blood pressure, weight, height, etc. Thus a person with blood pressure above the 80th percentile has a greater blood pressure value than over 80% of the other recorded values.” (CDS, p.323)
WID The 'World Inequality Database' (WID) aims to provide open and convenient access to the most extensive available database on the historical evolution of the world distribution of income and wealth, both within countries and between countries. [WID.WORLD](https://wid.world/wid-world/)
Back to top

References

Author Collective ’World Inequality Lab’. 2024. “WID - World Inequality Database.” https://wid.world/.
Blanchet, Thomas. 2024a. “Wid: Download Data from WID.world.” https://github.com/WIDworld/wid-r-tool.
———. 2024b. “Wid: Download Data from WID.world.” https://github.com/WIDworld/wid-r-tool.
Blanchet, Thomas, Lucas Chancel, Ignacio Flores, and Marc Morgan. 2024. “Distributional National Accounts Guidelines: Methods and Concepts Used in the World Inequality Database (Version 2024-02-27),” February. https://wid.world/document/distributional-national-accounts-guidelines-2020-concepts-and-methods-used-in-the-world-inequality-database/.
Blanchet, Thomas, Lucas Chancel, and Amory Gethin. 2019. “How Unequal Is Europe? Evidence from Distributional National Accounts, 19802017.” WID.world, Working Paper, no. 2019/06 (April): 77. https://wid.world/document/bcg2019-full-paper/.
Global Change Data Lab, and Max Roser. 2024. “OWID Homepage.” Our World in Data, March. https://ourworldindata.org.
Jestl, Stefan, and Emanuel List. 2020. “Distributional National Accounts (DINA) for Austria, 2004-2016,” October. https://wid.world/document/distributional-national-accounts-dina-for-austria-2004-2016-world-inequality-lab-wp-2020-23/.
LIS Author Collective. 2024. “LIS Cross-National Data Center in Luxembourg.” https://www.lisdatacenter.org/.
Morgan, Marc, and Theresa Neef. 2020. “Whats New About Income Inequality in Europe (1980-2019)?” World Inequality Lab, Issue Brief, no. 2020/04 (November): 16. https://wid.world/document/whats-new-about-income-inequality-in-europe-1980-2019/.
Piketty, Thomas. 2020. Capital and Ideology. Illustrated Edition. Cambridge, Massachusetts ; London, England: Harvard University Press.
———. 2022. A Brief History of Equality. Cambridge, Massachusetts: Harvard University Press.
Piketty, Thomas, and Arthur Goldhammer. 2017. Capital in the Twenty-First Century. Harvard University Press.
World Bank. 2024. “Poverty and Inequality Platform (Version: 20240326_2017).” https://pip.worldbank.org/home.
World Inequality Lab. 2024. “Codes Dictionary.” https://wid.world/codes-dictionary/.

Footnotes

  1. This is somewhat disappointing. As I do not know STATA, these code files do not make much sense to me. However, a quick internet search showed that there are some resources to learn how to translate STATA directives into R commands. Two examples: Making the Transition? A Guide for Switching from Stata to R and STATA to R cheat sheet.↩︎

  2. How important this check is showed me a previous version in which I had not included ages = 992 because I thought the adult population was already defined with pop = "j", the letter code for equal-split adults. But it turned out that without the explicit definition for the age, I also downloaded the default value 999 (all ages).↩︎