Unlocking the black box: how to calculate book-to-market?

The book-to-market ratio (BM) is one of the most frequently calculated variables in accounting research and is usually used as a control variable in regressions. However, how to calculate it is a question I often asked when I started my PhD, as nobody defined it clearly in their papers. Should I use CEQ or SEQ, assets minus liabilities, or something else? This question becomes even more confusing when I realize that finance researchers often define the book value of equity differently from accounting researchers.

A related question is how to verify the basic accounting equation (A = L + E) using Compustat Annual (FUNDA) or Quarterly (FUNDQ) data? To answer this question, I checked the Compustat Manuals – Balancing Models – North American Company Data:

Therefore, the basic accounting equation is reflected in the following relationship:


Or AT = LT + MIB + TEQ

The first equation can be verified using FUNDA data. However, the second equation holds true for only 42% of observations. Further investigation indicates that prior to FASB 160, LSE = LT + MIB + SEQ, while after FASB 160, LSE = LT + MIB + TEQ. That’s why the above second equation won’t hold true prior to FASB 160.

Returning to the initial question, BM = Book Value of Equity (BVE) / Market Value of Equity (MVE). When calculating MVE, we appear only able to use MVE = csho × prcc_f, where csho is the number of common shares outstanding, and prcc_f is the share price at the fiscal year-end. Therefore, MV will be the market value of common shares. To match this, it seems appropriate to define BVE as ceq(also common shares), and thus BM = ceq / (csho × prcc_f).

In contrast, in the sample code provided by WRDS to replicate Fama-Frech’s three factors, as well as in the Financial Ratios macro provided by WRDS, BVE = seq + txditcpstk. The definitions of these variables can be found in the above figure. Although seqpstk = ceq, I have no clue why txditc should be added. Perhaps finance researchers can shed light on this.

Posted in Learning Resources | Tagged | Leave a comment

Display full variable names in State Results window

In State, the default width for displaying variable names in the Results window is set to 12. Sometimes it can be frustrating not to see the full variable names. Consider the following results from the tabstat command:

Ideally, we’d prefer to display the full variable names as shown below:

Unfortunately, there isn’t a global setting to adjust the display width universally. Instead, we must change the default width within each specific command, and some commands even don’t provide an option for this purpose.

Below, I’ve listed the options to control display width for some commonly used commands:

tabstat ..., va(#), 8 <= # <= 32

mdesc ..., ab(#)

sum – no options for adjusting the display width

regress – no options for this, but you can view the full variable names by widening the Results window.

To see each command’s available options, type help [command].

Posted in Stata | Leave a comment

How to prevent the “log file already open” error when working with log files in Stata

Sometimes we use the log using filename, replace command at the beginning of a Do file to open a log file and record everything displayed in the Results window. However, if an error interrupts the previous execution of the Do file, re-running it after debugging may fail simply because we forget to manually close the previous log file. In such cases, Stata will return the error message “log file already open” in the Results window.

A simple trick to prevent the “log file already open” error is to include the capture log close command before log using filename, replace:

This structure ensures that any previously open log files are closed before attempting to create or replace the log file, thereby maintaining a clean and error-free workflow when working with log files in Stata.

Posted in Stata | Tagged | Leave a comment

Stata command to draw a graph showing the average of a variable by group

Suppose we have the following dataset that provides the life expectancy (lexp) of each country in each region:

We can draw a bar graph to show the average (or other statistics) life expectancy of each region using two different methods:

Method 1: using the collapse command

Method 2: using the egen tag command

Method 1 will actually destroy the original data. If we need to keep the original data available for use afterward, Method 2 may be more convenient.

Posted in Stata | Tagged | 2 Comments

Stata command to rename groups of variables

Swap variable names: rename (v1 v2) (v2 v1)

Lowercase (or uppercase, or propercase) groups of variable names: rename V1 V2, lower or rename *, lower

Add suffix _old to variables v1, v2, … for one or more digits: rename v# =_old

Remove suffix _old from all variables ending in _old: rename *_old *

The meaning of each specifier:


Posted in Stata | Tagged | Leave a comment

Link FactSet and CRSP

Both FactSet and CRSP offer identifier files that contain primary identifiers at the entity level and security level (note: an entity may issue multiple securities). These files provide a mapping between their primary identifiers and all other historical identifiers such as CUSIP and ticker. By utilizing these identifiers, we can construct a link table.

Relevant CRSP Variables

The CRSP STOCKNAMES file provides a mapping between CRSP permanent identifiers and all historical CUSIPs, company names, and exchange tickers, along with their effective date ranges.

  • PERMCO is the unique identifier for CRSP entities. PERMCO remains constant over time, even if the company changes its name, CUSIP, or exchange ticker.
  • PERMNO is the unique identifier for CRSP securities. PERMNO also does not change over time too.
  • COMNAM represents the company name in CRSP at a specific point in time.
  • NCUSIP is the 8-digit historical CUSIP for CRSP securities. A change in CUSIP can occur due to various reasons, including non-fundamental events like splits and company name changes. CRSP also has a header variable named CUSIP that reflects only the most recent CUSIP for the entire time series.
  • NAMEDT and NAMEENDDT represent the first and last effective dates of each name structure, which includes CUSIP, company name, exchange code, exchange ticker, share class, and SIC code. Each PERMNO has at least one name structure in the STOCKNAMES file.

Relevant FactSet Variables

The FactSet SYM_V1_SYM_CUSIP_HIST file provides a mapping between FactSet permanent identifiers and all historical CUSIPs, along with their effective date ranges.

  • FSYM_ID is the unique identifier for FactSet securities (similar to PERMNO in CRSP). FSYM_ID remains unchanged over time. In another identifier file, ENT_V1_ENT_SCR_SEC_ENTITY_HIST, FactSet provides a mapping between FactSet securities and their issuing entities, uniquely identified by FACTSET_ENTITY_ID, which, like PERMCO in CRSP, does not change over time.
  • CUSIP is the 9-digit historical CUSIP for FactSet securities. FactSet also has a header variable named MOST_RECENT that reflects only the most recent CUSIP for the entire time series.
  • START_DATE and END_DATE represent the first and last effective dates of each name structure.

Linking FactSet FSYM_ID and CRSP PERMNO using CUSIP

Both FactSet FSYM_ID and CRSP PERMNO are used to uniquely identify securities. We can link the two using the associated historical CUSIPs, which serve as a common identifier in both FactSet and CRSP.

The following two-part code generates a link table that provides a mapping between FactSet FSYM_ID and CRSP PERMNO, along with their effective date ranges (START_DATE and END_DATE). The first part uses SAS, and the second part uses Stata. An SAS expert should be able to complete the task entirely in SAS. I use Stata for the second part simply because a handy command (tsspell) is available in Stata.

First-part SAS code:

Second-part Stata code:

Please set up the LIBNAME and directories in SAS and Stata to properly reference the relevant datasets. In Stata, use the command ssc install tsspell, replace to install the third-party command.

Please note that in the link table, the END_DATE column represents the most recent update date of the CRSP STOCKNAMES file (rather than a missing value) if the link is still active. Therefore, it is important to use the most recent CRSP STOCKNAMES file to avoid any unexpected error.

I find that the link table only covers a small portion of the FactSet universe, as CRSP only keeps historical CUSIP data for securities traded in major exchanges, while FactSet extracts data from a much large security universe. Specifically, only 36,560 FSYM_IDs can be linked to a PERMNO, whereas there are 310,234 FSYM_IDs in FactSet with FACTSET_ENTITY_IDs.

I would like to express my gratitude to two WRDS articles: Merging CRSP and COMPUSTAT Data and Linking IBES and CRSP Data for their valuable insights.


Posted in Data | Tagged | 7 Comments

A test on Stata running speed on MacBook Pro (M1 Pro chip) and old Macs

I ran a test on Stata running speed on my newest MacBook Pro (14-inch, 2021) and two old Macs—iMac (27-inch, 2019) and MacBook Pro (16-inch, 2019).

Technical specifications:

  • MacBook Pro (14-inch, 2021): CPU Apple M1 Pro (10-core), memory 16G, SSD 1T
  • iMac (27-inch, 2019): CPU Intel i9-9900K (3.6GHz 8-core), memory 64G, SSD 1T
  • MackBook Pro (16-inch, 2019): CPU Intel i9-9880H (2.3 GHz 8-core), memory 16G, SSD 1T

The Stata code I tested includes some simple commands only, such as loop, egen, and merge. As I’m using Stata SE, the single-core performance of the three Macs should be the most important determinant.

MacBook Pro (2021) took 65.71 seconds to complete the test, while iMac (2019) took 96.59 seconds and MacBook Pro (2019) 117.16 seconds.

I’m surprised to see this result. The M1 Pro chip seems a beast!


Posted in Stata | 6 Comments

My thoughts on Python for accounting research

There is a temptation for accounting PhD students to invest in learning Python. However, I would recommend that accounting PhD students focus more on SAS + Stata than on Python in their first year for a few practical and technical reasons:

  • While Python may seem trendy, its user base in accounting research is still relatively small. Research often involves co-authorship. If the research code can only be understood by one person in a research team, it can be counterproductive.
  • Due to the limited usage of Python among researchers, Python users cannot take full advantage of the many ready-to-use code written in SAS, which has a larger user base.
  • Debugging in Python is more challenging compared to commercial software like SAS and Stata. While freeware may sound appealing, I have always found that I end up spending more time when using freeware, as it typically lacks detailed and excellent help documentation.
  • Similar to Stata, Python currently has limitations in manipulating large datasets, as it is constrained by the the computer’s memory size. If you anticipate working with mega datasets such as intra-day bond/stock transaction data, using Python can be a pain.

SAS + Stata is a solid solution that I advocate for. I use SAS for SQL and Stata for all other tasks. Python is well-suited for specific research topics such as textual analysis. If you decide to invest in learning Python, you can rely on the following learning resources:

  • The Python Tutorial on the official Python website (https://docs.python.org/3.10/tutorial/index.html) is an easy and quick read.
  • There are numerous Python courses available on Udemy (https://www.udemy.com), often offered at discounted prices of $13 per course.
  • An excellent methodology article authored by Vic Anand, Khrystyna Bochkay, Roman Chychyla and Andrew Leone (2020), “Using Python for Text Analysis in Accounting Research”, Foundations and Trends® in Accounting: Vol. 14: No. 3–4, pp 128–359. https://dx.doi.org/10.1561/1400000062.
Posted in Learning Resources, Python | Tagged , | 1 Comment

Stata command to perform propensity score matching (PSM)

Most propensity score matching (PSM) examples typically use cross-sectional data rather than panel data. However, in accounting research, panel data (observations with two subscripts i and t, e.g., firm-years) are often used in a difference-in-differences (DID) research design. This involves two dummy variables, TREATMENT and POST , in the following regression:


where TREATMENT indicates a treatment event and POST indicates before or after that event. In this context, it is common to perform one-to-one matching using selected pre-event and firm-level variables (Xs). These pre-event variables can be measured either at the most recent date before the event (e.g., total assets at the most recent quarter end before the event) or as an average over the pre-event period (e.g., average total assets in the four quarters preceding the event).

To conduct PSM, a probit or logit regression is needed:

TREATMENT = X1 + X2 + …

The single nearest neighbour based on propensity score is selected as the matched control observation. The treatment observations and their respective matched control observations then form the sample for subsequent DID regressions.

In Stata, the third-party module psmatch2 is commonly used to find matched control observations using PSM. To install the module, the following command can be used:

ssc install psmatch2

Once installed, the following command is typically used:

psmatch2 TREATMENT X1 X2 ..., [noreplacement logit descending]

There are three options in the above command:

  • noreplacement – Perform one-to-one matching without replacement. I would add this option if I want to find more unique matches.
  • logit – Uses logit instead of the default probit regression to estimate the propensity score. I would be indifference between using logit and probit.
  • descending – More details about this option can be found in Lunt (2014). The author concludes that “in the absence of a caliper (another option that I would omit to maximize the number of matches), the descending method provides the best matches, particularly when there is a large separation between exposed (treated) and unexposed (untreated) subjects.” Therefore, I would add this option.

psmatch2 creates several variables, with _id and _n1 being the most useful for subsequent DID regressions:

  • _id is a new identifier created for all observations in the case of one-to-one and nearest-neighbors matching.
  • _n1 stores the new identifier (_id) of the matched control observation for every treatment observation.

There is one limitation with psmatch2. Sometimes, we may want the treatment and its matched control to have the same value on a variable X. For example, we may want the treatment and its matched control to be drawn from the same industry, or both to be male or female. psmatch2 lacks a direct solution for this requirement. Some imperfect workarounds, such as adding i.industry or i.gender in Xs, are discussed in this post. In contrast, the PSMATCH procedure in SAS provides a perfect solution by offering the EXACT= statement. I am not sure if SAS achieves this by implementing a stratification method, but if it does, it is possible that psmatch2 in Stata could achieve similar results by tweaking its options. More details on the PSMATCH procedure in SAS can be found in this manual.

It is worth noting that that psmatch2 is preferable to Stata’s built-in command teffects because the variables generated by psmatch2 (particularly _id and _n1) are necessary for subsequent DID regressions, whereas teffects does not return such variables.

This article aims to provide a quick how-to and may omit some necessary steps for PSM, such as assessing covariate balance. A more rigorous discussion on PSM in accounting research can be found in Shipman, Swanquist, and Whited (2017).

I would like to express my gratitude to the authors of the following articles that have been beneficial in preparing this post:

Posted in Stata | Tagged | 10 Comments

Clean up TRACE Enhanced dataset

WRDS provides an excellent manual (link) and SAS code (link) for cleaning up the raw TRACE Enhanced bond transaction data, primarily based on the work done by Dick‐Nielsen, Jens, How to Clean Enhanced TRACE Data (December 3, 2014). Available at SSRN: https://ssrn.com/abstract=2337908. Dick‐Nielsen also provides his SAS code for the clean-up. Several papers refer to his cleaning steps.

Both WRDS and Dick-Nielsen’s codes remove cancellations, corrections, reversals, and double counting of agency trades. Dick-Nielsen’s code provides a few more options, e.g., remove commissioned trades.

Posted in SAS | Leave a comment