as attached – STATS ONLY Bonus HW – The Power of Statistics

All we want is power

– Statisticians

https://en.wikipedia.org/wiki/Power_(statistics)

This assignment is hard.

It is not required, and I expect most of you to skip it. It is more useful for learning new skills than for earning bonus points. You can earn up to 10 points to your Total grade (out of 100), that is equivalent to two HWs, 50% of a Midterm or about 33% of Final. But these points are relatively harder to earn.

Housing is a special asset. It is owned by the majority of population – unlike stocks that are held mostly by the rich. At the same time, different people buy different houses – the dynamics of the sub-markets depend on the characteristics of these people (e.g. owners of small houses default more frequently because they are poor). Studying the distribution of households’ finances (in addition to just total numbers) is useful for understanding how the housing and mortgage markets work and their dynamics, especially the boom-bust cycles.

The goal of this assignment is making just the first step – learning how to work with detailed micro-level data on the US households. The main data source about the balance sheets at the household level is the Survey of Consumer Finances (SCF). We are going to compute a few numbers and produce a few plots with more emphasis on learning how to compute and to plot rather than going deep into interpretations and trying to understand the whole universe of the data. It may be not as exciting as the alternative (warning: boring stuff!), but it will help you obtain the skills and have a lot more fun later (if you survive this HW!).

Part 0 {0 points} Getting the Data

1. Download the Summary Extract Public Data:

https://www.federalreserve.gov/econres/scfindex.htm

This is a smaller version of the dataset – the only one available in Excel. It’s still pretty big and ugly. Enjoy.

2. Read the description of the SCF data. You may watch the short video to see what’s possible, but it’s not very helpful for this hw.

The Federal Reserve provides documentation that explains the data, but it is not very user-friendly. The only way to fully understand the data is to read the “Variable Definitions” and “Codebook” files. Most likely, you won’t need to look at these files at all to do this assignment. There may be better explanations available elsewhere, that I am not aware of (you can use them at your own risk, but you don’t need to).

Part I {4 points} “Lies, Damn Lies and Statistics” (author unknown)

We are going to learn how to work with the data so that the statement in the title no longer applies to you, i.e. you’ll understand the data better and catch misinterpretations faster.

Let’s say we want to understand the role of houses and mortgages in balance sheets of the US households – how big are they relative to other items? Here is the full balance sheet chart from the SCF (all variables are available!):

https://www.federalreserve.gov/econres/files/Networth%20Flowchart.pdf

1. Find the following variables in the Summary Extract Public Data file: (a) Total net worth, (b) Total assets, (c) Total debt, (d) Primary residence, (e) Residential property excluding primary residence, (f) Debt secured by primary residence, (g) Debt secured by other residential property, (h) Age, (k) Income, (l) Directly held stocks, (m) Weights (Column C, variable WGT).

2. Define two new variables

a. “housing” = d + e from above

b. “mortgages” = f + g from above (strictly speaking, it’s not only mortgages, but it’s close)

3. {most important step of the whole HW, 0 points } Make sure you understand how to use the weights – they are needed for all calculations below. You will get zero credit if you use the weights incorrectly (because all your numbers will be wrong by far). Basically, the weight is the “size” of each observation, you may interpret it as “how many households of this type are in the US” (up to some scale). Instead of simple averages (aka means), you need weighted averages (aka weighted means). Same for medians. Google it. See more hints at the end of this file.

4. {1 point} Compute the means and medians for all the variables above except for the weights (12 variables!). Make sure you use the weights correctly for both means and medians. For monetary variables, use thousands USD. When means are far from medians, think about why it happens (for fun, not for report).

Below we will compute only the following 3 ratios (but in 7 different ways MWAHAHA):

5. Use your results from part 4 to compute the ratio of mean housing to mean assets, mean mortgages to mean debt, and mean housing to mean net worth. These are useful to understand the composition of assets and debt in the US as a country (but note that it’s only households sector, not firms or govt). E.g. mortgages comprise over 78% of the total US household liabilities, while housing is less than 1/3 of assets.

6. Does the above mean that average mortgage borrower has a mortgage that’s 78% of his/her liabilities? No, because not everyone got a mortgage. Similar critique applies to houses (not everyone got a house). Compute the ratio of mean housing to mean assets among homeowners (housing>0), the ratio of mean mortgages to mean debt among mortgage borrowers (mortgages > 0), the ratio of housing to net worth among homeowners (housing>0). The role of housing did not go up much, but the mortgages are even more important: they comprise almost 86% of debt for mortgage borrowers! But hold on.

7. The means are influenced by outliers; therefore, they may be misleading if we want to think about a typical US household. E.g. say, we have one rich and 9 regular households. The rich one got a $1m house and a $10m stock portfolio, while each regular one has a $200k house and only $10k in stocks. Then housing is less than 22% of assets for all the households combined but it’s over 95% of assets for 9 out of 10 households! Crazy example! But you have not seen the real numbers yet. One way is to talk about “typical” households is to use medians instead of means. Compute the ratios from question 5 using medians instead of means. As expected, we got a larger fraction of housing in assets (more than half!), but the other two numbers look surprising. This is because less than half of households got a mortgage, so the median is zero. The ratio of housing to net worth can well be greater than 100% – this just reflects the fact that many houses carry mortgages, so home equity (that enters the net worth) is less than the value of the house.

8. Combine the approaches in the two previous questions: compute the ratios using medians and restricting the sample as in Q6. The fraction of housing in assets is even higher, while the fraction of mortgages in liabilities is smaller, and the ratio of housing to net worth looks more reasonable.

9. It is also useful to compute the ratios above for individual households and work with them. But for this we need to make sure that the denominator is positive for all observations. So restrict the sample in each case. E.g. for housing / assets use only data on households who have positive assets, for mortgages/debt use only those with positive debt. Compute the means of individual ratios. Think about what it means.

10. Compute the medians of individual ratios. For housing, results are remarkably similar – robustness is good!

11. Another way to think about typical households is to drop non-typical ones and study the rest of the sample. The most common scale for sorting households is income. So let’s pick typical households by income. Restrict the sample to the middle 20% by income (aka the middle quintile), i.e. sort the data by income, and remove the top 40% and the bottom 40% of population (make sure to use the weights here!). Most common mistake: first, remove the bottom 40%, then remove top 40% of the remaining households (that is wrong: you are removing only 0.4*0.6 = 24%, but you need to remove 40%). Compute the ratios of means (as in Q5).

12. {1 point} Finally, we can talk about the distribution of housing vs stocks. Compute the fraction of total housing owned by the top 10% of households (sorted by housing assets, not by income). Compute the fraction of total stock holdings[footnoteRef:1] owned by the top 10% of households (sorted by stock holdings). Finally, compute the fraction of total income earned by the top 10% of households (now sorted by income). This is one of a few popular inequality measures. Note: for each of the 3 measures we use different households, because the sorting is different (though highly correlated: the rich usually got much income, stocks, and housing). [1: For simplicity, we are using only directly held stocks. More precise calculation would account for stocks indirectly held through other financial assets (pension, mutual funds, etc.). For each household, SCF reports the fraction of stocks in each asset type that may contain stocks.]

Part II {6 points} Visualizing the impossible

1. {2 points} SCF allows us to tell a detailed story of (financial) life for a representative US household.

a. This time just keep the households with income between $20k and $200k and assets below $1m – it’s about 70% of the sample[footnoteRef:2], good for representativeness [2: Specifically, you will have 17745 observations that sum up to 69.21% of the sample (using weights, as always)]

b. Split your sample into 12 age groups: below 26 years old, 26-30, 31-35, 36-40, … 71-75, over 75.

We will split all assets and debt into 4 parts each to see how their composition evolves over time.

i. Assets: (1) stocks, (2) non-stock financial assets (fin assets minus stocks), (3) housing, (4) non-housing non-financial assets (non-financial assets minus housing). I.e. we split fin assets and non-fin assets into 2 parts each.

ii. Debt: (1) mortgages & home equity loans on primary residence (NH_MORT variable), (2) other residential debt (HELOC+RESDBT), (3) education loans, (4) all other debt (DEBT minus the above debt categories). I.e. we split the debt into residential and the rest, and then split each of them in 2 parts.

c. Compute the means for the following variables within each group

i. age

ii. homeownership rate (equals 1 if housing is positive and zero otherwise)

iii. net worth

iv. assets (total and 4 parts above)

v. debt (total and 4 parts above)

d. {1 point} Plot the means above (all except mean age) as functions of mean age (total 12 plots!). Bar plot is the most common for this application. Here are a few plots to check your results:

Interpretation: We can see that only 20% of households younger than 25 are homeowners, but homeownership doubles over the next 5 years – this is when most buy their first house (and get their first mortgage). It is hard to get a mortgage earlier because they don’t have money for the down payment (you can see very low net worth). Most of other future homeowners get their residences (and mortgages) during the next 10-15 years. About 80% of households live in their own home when they retire. More than half of net worth accumulation after age 45 is through repaying mortgages.

e. Think about how to interpret other plots: tell a story of a typical US family[footnoteRef:3] to your grandma / cat / dog (do not write it in your report). [3: For more fun (not a part of this HW) you can pick other households, e.g. tell a story of a rich vs poor family – how the evolution of their finances differs. Then it will be hard to say what causes what, but you can try]

f. {1 point} Plot two more variables of your choice (not necessarily related to housing or real estate) and write what we can learn from your plots (roughly 1-2 sentences per plot or 1 paragraph total).

2. {2 points} Now we go in full 3D. Plot the joint distribution of Loan-to-Value and Payment-to-Income, i.e. your vertical assets will be the fraction of households (in percentage; use weights!), and your other two axes are LTV and PTI. For this plot, define Loan = all loans against primary residence, Value = primary residence, Payment = PAYMORT1 (this is just the first mortgage, but ok for simplicity), Income (you already have it). Keep only households with and . Make sure your plot is easy to read (pay special attention to scale and angle of view). Alternatively, you can make a 2D plot (LTV and PTI axes) and show the density by color (aka heatmap). You can submit both plots if you want.

Here are the plots for incomes between 30k and 100k (yours will differ). The 2d one looks more informative.

3.

{2 points: 1 for Lorenz + 1 for Gini} Finally, we go crazy study how concentrated are stocks and housing relative to income and to each other. Do not restrict the sample in any way in this question (use full data that you downloaded). Google about Lorenz Curve and Gini Coefficient[footnoteRef:4]. Plot Lorenz curves for housing, stocks, and income. Put all three curves of different colors on one plot – for easy comparison: housing red, stocks blue, income black (or similar colors) – for easy grading :-)). Add a 45-degree line (black dashed). Compute Gini coefficients. On the right, there is the solution for income (Gini = 0.5977). You will see that stocks are extremely concentrated, while housing is more like income. This means that to understand the stock market, we should study a small fraction of rich people who hold most stocks, while for the housing market we have to study the majority of households. [4: Hint: all you need is to plot the fraction of total housing (stocks, income) owned by a given fraction of households (like in the last question of part 1, but instead of 10% consider many percentages). For the Gini, you can e.g. compute the area below your plot (but not by hand!) and do a simple calculation.]

To submit your work, send the report and all the intermediate files to pavel.krivenko@gmail.com with email topic Bonus HW Fall 2019 followed by your Fist Name and Last Name. For the report, use the template file (this file also contains more answers to check your work). For intermediate files, make sure they are easy to read and understand. If the files are too big for one email, upload them somewhere and send a link. Do not send multiple emails. If you use software that allows to write code, just submit your code that uses the original SCF file (this will be easiest to upload and easiest to check, regardless of the programming language). If you modify SCF file before running the code, submit also the modified file along with a detailed description of what exactly you did and in what order (aka algorithm) so that a person who does not know anything is able to replicate your results.

What questions to ask

You can ask only clarifying questions about what to report. Here is an example.

Question: how many decimal places to keep in answers?

Answer: at least 4 digits total, e.g. you can round 1,234.5678 to 1,2345 and 0.12345678 to 0.1235 (not to 0.12).

What questions not to ask

a) How to access, install or use software (Excel, Matlab, Stata, Python, C++, C, Fortran, Assembly…) or where to find tutorials etc. Most software is available to Baruch students for free (e.g. Matlab). If you have trouble accessing Baruch software license, contact Baruch IT support. Excel is enough but not the most convenient for this HW (e.g. you can do all the means but will have hard time with medians).

b) Questions about SCF. Learning about a database on your own is a part of the HW.

c) Don’t send me your intermediate work to ask why the answers don’t match the sample answers. Use the hints below instead.

Grading

a) There is no partial credit for wrong numbers, but each right number gets credit. For example, in Q4 you need to report 19 numbers (24 total but 5 are given in template). If 10 of them are right, you get 10/19.

b) In Part I, it’s 2/7 point for each question 5-11.

c) Everything else is case by case.

Ethics

a) This is an individual assignment. You are not allowed to work in groups, share your work with others or get help from anyone else.

b) If some part is too similar in the HWs of two students, both get zero credit for the whole assignment (not only for that part). I will run a comparison tool that identifies similarities (changing the format, fonts, colors, variable names or small changes in content will not help).

Hints

a) First, use Google search to learn about weighted means and medians, Lorenz Curve, Gini coefficient, and anything else you don’t understand in this assignment. Try to understand the formulas.

b) Second, try to compute it on your own and compare your answers to the answers I give you in the template. If all the answers match, it’s almost certain that you are doing it right (and vice versa).

c) If you give up on step (b), use Google search to learn how to compute all the above using your particular software (i.e. not just formulas anymore). E.g. search for “weighted mean in Excel” or “weighted median in Matlab”. This step will most likely work, but steps (a) and (b) are better for learning.

d) In some cases, you can find third-party code useful (e.g. weighted median on Matlab file exchange). You are allowed to use it but (1) include a reference to this code in your report (otherwise it’s plagiarism and zero credit), (2) you are using it at your own risk – if the code does not work, you don’t receive credit.

Have fun! 🙂