Using the WID website
Key resources
To get started effectively, one has to overcome a paradoxical situation: The more detailed the methodological guidelines, the more difficult it is to find an appropriate entry point to get started.
Resource 2 : Methodological Guidelines & Other Resources
- Distributional National Accounts (DINA) Guidelines: This is the central resource for understanding the methodology behind the decomposition of national accounts and the inequality series for income and wealth. The purpose of these Distributional National Accounts (DINA) guidelines is to present the concepts, data sources, and methods used in the World Inequality Database (WID). It is a long document with 186 pages (Blanchet et al. 2024).
- Codes Dictionary (Code Book): It describes the general structure of the World Inequality Database (WID). It explains how information is organized within it and how to retrieve and interpret its content. This document is invaluable for getting familiarized with the construction of WID variables and the structure of the database (World Inequality Lab 2024). Converting the web page into a PDF results in a document of 18 pages.
- Summary Table: It summarizes all available WID.world data. The search tool can look for specific countries, variables, or WID.world codes. It has currently 6253 rows and 16 columns. The summary table is an efficient starting point because it allows for a view of available data, which one can then download for more detailed analysis.
- Computer Code: On the GitHub platform, you can inspect and fork the repository that contains all the code used to create the entire database from both the researcher’s data and official sources. It consists of 98% of STATA code, the rest (only 2%) is written in R. 1
- Data: There are two ways to download data:
- Direct Download: The interactive page allows one to select the required data and download it in the format that best suits one’s needs.
- Advanced Interfaces: For special requirements, one can access the data with STATA or R interfaces. In the case of the R interface, one can download from GitHub the R
{wid}
package (Blanchet 2024a).
It is a lot of work to read all the necessary information before starting. Instead of using this (tedious) sequential approach, I will try an interactive procedure, iteratively changing between theory (methodological guidelines) and practice (using R to replicate the figure).
Replicating income distribution
Figure to replicate
As my test bed, I will use Figure 7 of BHE (Piketty 2022, 46).
Interactive Replication
As a first approach, I will replicate Figure 7 of BHE with the provided interactive tools of the WID. Here is a good starting point for income inequality.
From here, I will show screenshots for every step necessary to replicate Figure 7 of BHE, but here in this post, it is Figure 1. If you click on one of the small images below, it will open the figure in the bigger original size, and you can navigate back and forth between the figures in the table below.
Figures 3-6
Figures 7-10
Figures 11-14
Differences between original and replication
If we compare both figures side by side, we will see a significant difference. Our replication differs from the original in the years before 1900. For instance, in our figure, the people with the top 10% of income share about 60% of the total income, in contrast to the 50% share in the original graph. I need to find out where these differences come from. There may have been a correction of the computed extrapolations in an update after the book publication. In any case, this difference is not essential here. Remember, the primary purpose of the exercise is to understand the patterns of the database structure — and not (at least at the moment) to get correct figures.
There are other differences between the two pictures, but they are more of cosmetically nature:
- The choice of colors for the various time series is different.
- The labels of the X- and Y-axes are different. The original is labeled every 20 years and contains a vertical line every 10 years, whereas the interactive replication has a 25-year grid. The Y-axis in the replication is missing the %-sign.
- In the original, the time-series have different symbols in the data points (circle, triangle), which is vital in black & white (book) publications.
- The legend is in the original at a prominent place in the middle of the graph, whereas it is in the interactive replication at the bottom.
Display the results
Until now, I have used manually taken screenshots. But there are also several possibilities to display results offered by the WID.
Download a JPG-file
Note that there are now fewer labels on the X- and Y-axis. The legend’s appearance has changed somewhat, and some information that is irrelevant in a static view is removed.
Copy the URL
Another option is to copy the URL and integrate it as a link into the Quarto text. It opens up a web page with the figure and all the WID menus in your default browser. Others can explore the interactive graph or create their own figure from here.
See the example: Income inequality, France, 1820-2022
Embed as iframe
Another possibility is to embed the graph into an iframe.
The legend overlaps the lower part of the graph. You have either to plan during the construction of the graph to prevent this drawback or (better) to adapt the code in the URL itself.
Besides, there is also an annoying cookies-consent message where the link to ‘learn more’ is wrong by pointing to another graph and not to the cookies policy.
The code in the above iframe is:
<iframe width="800" height="400" src="https://wid.world/share/#1/countrytimeseries/sptinc_p90p100_z;sptinc_p0p50_z;sptinc_p50p90_z/FR/2015/eu/k/p/yearly/s/true/9.4765/70/curve/true/1820/2022"></iframe>
It turned out that besides changing the delimiter for the axis from /9.4765/70/
to /0/62/
it is also necessary to enlarge the iframe’s height to position the legend lower. With trial & error experimenting, I got a good result after changing the height from 400
to 500
. Here is the adapted iframe where you can see the full figure and also the slider to change the time period of the data:
Getting codes for indicators
By far, the most important result of our replication is that the URL contains the code for used indicators, areas, and percentiles.
https://wid.world/share/#1/countrytimeseries/sptinc_p90p100_z;sptinc_p0p50_z;sptinc_p50p90_z/FR/2015/eu/k/p/yearly/s/false/9.4765/70/curve/true/1820/2022
This is very helpful because it is now much easier to find the code explication in the Code Dictionary or the Summary Table. Let’s analyze the example of the sptinc_p90p100_z
code:
- s: The first letter
s
determines the code series.s
means “share” and contains a fraction between 0 and 1. - ptinc: The next five letters are the code for the series concept. This is an essential help because a massive amount of five-letter codes exist. More than half of the 38 pages of the PDF version of the code dictionary lists five-letter codes and their meaning! Searching
ptinc
in the online Code Dictionary with the browser search tool results in “pretax national income”. - p90p100: This one is easy.
p90p100
is the top 10%, the percentiles characterizing people with an income equal to or higher than 90% of the population. Or p0p50 is the bottom 50% etc.
Additionally, we have the information from the legend that the units are adults, and the income is split equally between couples or other adult members (= 20 years or older) of the household. Therefore, we can add:
- 992: The age code description has three numbers.
992
stands for a population of individuals over age 20. - j: The last letter is the unit description.
j
means equal-split adults.
With these data, we can now look for the corresponding codes in the database and filter the database to get only the required information. The variable name is, in our case:
\[ \begin{align*} s + ptinc + 992 + j = sptinc992j \end{align*} \tag{1}\]
If we look for this complete code in the Summary Table, we can read in the column “Simple description of the variable” the following explanation:
Pre-tax national income is the sum of all pre-tax personal income flows accruing to the owners of the production factors, labor and capital, before taking into account the operation of the tax/transfer system, but after taking into account the operation of pension system. The central difference between personal factor income and pre-tax income is the treatment of pensions, which are counted on a contribution basis by factor income and on a distribution basis by pre-tax income.
Remember that I am not discussing the concepts behind the different parameter codes. I just want to stress how to find the code for a required indicator here. Which concepts to apply and what indicator is appropriate for a specific research question will be covered in other articles.
R Code
There is with {wid} a special R package for downloading data directly from WID (Blanchet 2024b). Read the instructions on the GitHub repo to get install information.
The R {wid} package should have a vignette with examples of its usage. But unfortunately, all the links to it are broken or not available except as .Rmd
file in the GitHub repository itself. I have downloaded and rendered the vignette as a PDF for your convenience.
Plotting data with R
The central function of the {wid} package is wid::download_wid()
. It has the following parameters:
Name | Type | Default | Description |
---|---|---|---|
indicators | character | all | Codes corresponding to indicators to retrieve |
areas | character | all | Areas (mostly countries) for which to retrieve the indicators |
years | numeric | all | Years for which to retrieve the indicators |
perc | character | all | Percentiles (part of the distribution) |
ages | numeric | all | Age groups (adults, all ages, elderly, etc.) |
pop | character | all | Population type (individual, households, tax units, etc.) |
metadata | logical | FALSE | Fetch metadata too? (eg. sources, etc.) |
verbose | logical | FALSE | Display messages showing progress? |
include_extrapolation | logical | TRUE | Include data based on extrapolations / interpolations |
Download data
R Code 1 : Download data with wid::download_wid
(For this R code chunk, is no output available)
Display some random data
R Code 2 : Display some random data imported with wid::download_wid
Inspecting the data, we see:
Our dataset has 339 records.
All presented records have an area code
FR
, which stands for France.All presented records have the variable name
sptinc992j
. This conforms to our previous calculation.We see three different values for percentiles in the randomly displayed data:
p0p50
,p50p90
, andp90p100
representing the income of the bottom 50%, the middle 40%, and the top 10% of the population.Our random data show a period from 1820 to 2022.
The quantity for each record contains the column
value
.
Overall, The result looks nice, but we must check the data more thoroughly. Is there really only one type of variable? 2
Display data structure
R Code 3 : Data structure for the France data for pre-tax income (1820-2022)
::skim(income_france) skimr
Name | income_france |
Number of rows | 339 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
country | 0 | 1 | 2 | 2 | 0 | 1 | 0 |
variable | 0 | 1 | 10 | 10 | 0 | 1 | 0 |
percentile | 0 | 1 | 5 | 7 | 0 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 1964.23 | 37.08 | 1820.00 | 1938.00 | 1966.00 | 1994.00 | 2022.00 | ▁▁▆▇▇ |
value | 0 | 1 | 0.33 | 0.12 | 0.12 | 0.21 | 0.34 | 0.45 | 0.61 | ▇▂▇▇▁ |
Here, we see that no data are missing, that we have only one value for the variables country
and variable
, and exactly the required three values for percentile
.
We also see from the small histogram for year
that we do not have many data points for more of the first half of the time.
Display the national income figure
R Code 4 : Pre-tax national income inequality in France, 1820-2022
Comparison of my result with the original Figure 7
We are pretty near if we compare the above plots. Besides the mentioned difference in the magnitude of the top 10% and middle 40% before 1900, there are the following differences:
- title
- grid for x- and y-axis
- symbols for the data points
- position of the legend
- text of legend
- the smoothness of the time series
However, these differences are all solvable with the appropriate R code. I leave it here because the purpose of this article was to work with WID (and not to replicate figure 7 of BHE exactly.)
Conclusion
I have shown procedures for working with the inequality data of WID.
- One approach explores the interactive website on wid.world and tries to create the desired graph. There are several ways to use the resulting figure:
- The second approach uses the {wid} R package, especially the function
wid::download_wid()
. Knowing the required codes for the download function is essential. The first approach could help find the correct code composition in the WID code book, the Codes Dictionary or in the Summary Table.
The post only describes “how-to” procedures for WID and does not discuss the available concepts appropriate for specific research questions. I plan to describe and reflect on concepts for inequality research in later blog posts.
Appendix
Glossary
term | definition |
---|---|
Advanced Organizer | Advance organizers are pedagogic devices that bridge the gap between what learners already know and what learners need to know. (<a href="https://link.springer.com/referenceworkentry/10.1007/978-1-4419-1428-6_157">Encyclopedia of the Science of Learning</a>) |
BHE | Piketty, T. (2022). A Brief History of Equality (S. Rendall, Trans.). Harvard University Press. |
DINA | Distributional national accounts (DINAs) aim to gather information on the distribution of the net national income and to explore it over time. ([Distributional National Accounts (DINA) for Austria, 2004-2016](https://wid.world/document/distributional-national-accounts-dina-for-austria-2004-2016-world-inequality-lab-wp-2020-23/)) |
iframe | iframes allow you to embed HTML documents inside another HTML document. They offer a seamless way to integrate content from one source into another, enabling developers to create more dynamic and interactive web pages. ([dev.to](https://dev.to/joanayebola/what-is-iframe-and-how-to-use-them-1c63)) |
JPG | JPEG (/ˈdʒeɪpɛɡ/ JAY-peg, short for Joint Photographic Experts Group is a commonly used method of lossy compression for digital images. [(Wikipedia](https://en.wikipedia.org/wiki/JPEG)) |
National Accounts | National accounts are a system of accounts and balance sheets that provide a broad and integrated framework to describe an economy, whether a region, a country, or a group of countries such as the European Union (EU). For internationally comparable national accounts this system needs to be based on common concepts, definitions, classifications and accounting rules, in order to arrive at a consistent, reliable and comparable quantitative description of an economy. National accounts provide systematic and detailed economic data useful for economic analysis to support the development and monitoring of policy-making. [eurostat - Statistics explained] |
Percentile | The set of divisions that produce exactly 100 equal parts in a series of continuous values, such as blood pressure, weight, height, etc. Thus a person with blood pressure above the 80th percentile has a greater blood pressure value than over 80% of the other recorded values.” (CDS, p.323) |
WID | The 'World Inequality Database' (WID) aims to provide open and convenient access to the most extensive available database on the historical evolution of the world distribution of income and wealth, both within countries and between countries. [WID.WORLD](https://wid.world/wid-world/) |
References
Footnotes
This is somewhat disappointing. As I do not know STATA, these code files do not make much sense to me. However, a quick internet search showed that there are some resources to learn how to translate STATA directives into R commands. Two examples: Making the Transition? A Guide for Switching from Stata to R and STATA to R cheat sheet.↩︎
How important this check is showed me a previous version in which I had not included
ages = 992
because I thought the adult population was already defined withpop = "j"
, the letter code for equal-split adults. But it turned out that without the explicit definition for the age, I also downloaded the default value999
(all ages).↩︎