How to remove duplicate GVKEY-DATADATE when using Compustat Annual (FUNDA) and Quarterly (FUNDQ)?

The annual data (FUNDA) is easy to deal with, we just need to apply the following conditions:

indfmt=="INDL" & datafmt=="STD" & popsrc=="D" & consol=="C"

If we have converted FUNDA to Stata format, the uniqueness of GVKEY-DATADATE can be verified using the following Stata command:

duplicates report gvkey datadate if indfmt=="INDL" & datafmt=="STD" & popsrc=="D" & consol=="C"

This command will return “no duplicates”.

The quarterly data (FUNDQ) is a little bit complicated. First of all, applying the same conditions won’t work. In fact, 99.7% observations in FUNDQ have already satisfied these conditions. But duplicate GVKEY-DATADATE pairs still exist in FUNDQ. The root cause of duplicate GVKEY-DATADATE pairs is a firm changing its fiscal year end. I use the following example as an illustration:

Variable definition: FYEARQ – fiscal year; FQTR – fiscal quarter; FYR – fiscal year end month; DATACQTR – calendar quarter; DATAFQTR – fiscal quarter; ATQ – total assets; NIQ – quarterly net income; NIY – year-to-date net income.

In this example, duplicates exist for three DATATEs: 2010-03-31, 2010-06-30, and 2010-09-30. The data suggest that on March 31, 2010, the firm changed its fiscal year end from March 31 to December 31 (i.e., FYR changed from 3 to 12). As a result, 2010-03-31 appeared twice in FUNDQ, one is fiscal 2009Q4 (based on the old fiscal year end) and the other is 2010Q1 (based on the new fiscal year end). FUNDQ also reports additional duplicates for the subsequent two quarters (I don’t know why). Next, if we compare NIQ and NIY as highlighted in the red rectangle, the observation for fiscal 2009Q4 indicates NIY > NIQ, which makes sense because NIY is four-quarter sum and NIQ is single-quarter net income. In contrast, the observation for fiscal 2010Q1 indicates NIQ = NIY, because both are single-quarter net income in this case.

So, what’s the best strategy to remove duplicate GVKEY-DATADATE pairs?

Before we answer this question, let’s take a closer look at duplicate GVKEY-DATADATE pairs in FUNDQ, which reveals that 99.8% of GVKEY-DATADATE pairs in FUNDQ are unique as of December 5, 2107. This suggests that no matter how we deal with duplicates, even simply delete all of them, our results probably won’t change in a noticeable way.

That said, if we want to remove duplicates in a more careful way, COMPUSTAT gives the following clue:

In the definition of DATAFQTR, COMPUSTAT notes that,

Note: Companies that undergo a fiscal-year change may have multiple records with the same datadate. Compustat delivers those multiple records with the same datadate but each record relates to a different fiscal year-end period.

Rule: Select records from the co_idesind data group where datafqtr is not null, to view as fiscal data.

Unfortunately, I find that the suggested rule is not the best strategy, because COMPUSTAT seems to set DATAFQTR as missing or non-missing in an inconsistent way. In my opinion, the best strategy is to retain the GVKEY-DATADATE that has reflected the most recent change of fiscal year end. This means, in the above example, we should delete the following observations:

    • DATADATE = 2010-03-31 and FYR = 3
    • DATADATE = 2010-06-30 and FYR = 3
    • DATADATE = 2010-09-30 and FYR = 3

Suppose we have converted FUNDQ raw file to Stata format, the following Stata codes will implement the above strategy and save the results in FUNDQ_NODUP. By the way, these are the cleverest Stata codes I’ve ever written 🙂 The codes also create a new variable FYQ (or CYQ) that represents fiscal year and quarter (or calendar year and quarter). The codes not only remove duplicate GVKEY-DATADATE pairs but also remove duplicate GVKEY-FYQ (or GVKEY-CYQ) pairs, which will then allow us to use the tsset command and do lag and change calculations in Stata, e.g., to get beginning-of-quarter total assets or calculate seasonal changes in sales. Stata really shines in lag and change calculations for panel data—a superb advantage over SAS.

Please note: I also agree with one of the readers’ comments that “(how to remove duplicates) depends on what you need”. For example, in one of my projects, I want to look at three-day CAR around earnings announcement date (RDQ) and use total assets as the deflator in my regression. As a result, when duplicate GVKEY-DATADATE pairs occur, the one with non-missing RDQ and ATQ will be preferred if I want to retain as many observations as possible. The following Stata commands will serve the purpose well:


This entry was posted in Data, Stata. Bookmark the permalink.

18 Responses to How to remove duplicate GVKEY-DATADATE when using Compustat Annual (FUNDA) and Quarterly (FUNDQ)?

  1. YANLEI ZHANG says:

    Hi, Kai, I think it depends on what you need, it could be either calendar quarter or fiscal quarter. I usually try to collect the information for each calendar quarter, in that case, I only keep the recent convention for the duplicated observations.

    • Kai Chen says:

      Hi Yanlei, thanks for your comments. I agree—it depends on the need. Many ways to do this, and perhaps it doesn’t matter which way to use since duplicate observations are so tiny percentage. But I do find Compustat hasn’t created datacqtr and datafqtr in a consistent way (missing or non-missing), which makes the two variables unreliable. But it would be fine if you create the two quarter variables by yourself from datadate, fyearq, and fqtr. Another thought is fiscal quarter cannot be ignored, particularly if you use year-to-date variables. For example, NIQ and NIY are quarterly and year-to-date net income. If you use NIY to calculate quarterly net income, fiscal Q1 NIY should equal NIQ; but for other quarters, you need take difference between NIY[t] and NIY[t-1] for quarter [t] number. Thus, fiscal quarter indicates the need to compute the difference or not. This may not be a problem since Compustat provides both NIQ and NIY and you can simply use NIQ. But for other variables like cash flows, Compustat only provides year-to-date data. In that case, fiscal quarter needs to be considered.

  2. Thanh says:

    Hi Kai. I found that there is a difference in the number of observations every download from Compustat in the same period. For example, I downloaded 10 variables from Compustat in 1993-2016 with 380,0000 observations. Then I need another variable and download this variable separately from Compustat. However, I found that the number of observations are different in 2 files regardless of the same period. After that, I merge two files. After merging, some observations are not matched in 2 files. Can you help me explain why the number of observations in 2 files are different for the same period in Compustat?

    • Kai Chen says:

      This happened because the data vendor constantly updated Compustat and they might update past data when new information came in. The best strategy in empirical work is to download a snapshot of full Compustat and use that snapshot throughout.

  3. Vincent Zhang says:

    Hi, Kai. Thanks for sharing! I am wondering whether you notice that there are firms report financial data in Canadian dollars in COMPUSTAT (“curcd” = CAD). However, I barely saw papers mentioned that how to deal with it. What’s your opinion? Should we exclude the firms or exchange them to US dollars?

    • Kai Chen says:

      Some researchers will apply additional filter: retain firms incorporated in USA (Compustat contains a variable for that). This should roughly solve your question. Of course, if currency matters, you should use proper exchange rate if you want to retain Canadian firms as well. But because most variables we calculate are ratios (e.g., scaled by total assets), currency may not matter anyway.

  4. Sujesh says:

    Hi Kai, the WRDS rule mentioned above mentions datacqtr (calendar quarter) and not datafqtr, if I am not mistaken. If all that you are interested in is exact duplicates, then the above method for quarterly data is fine. But there will still remain an issue of overlapping months, when firms change month of financial year end. In the strict sense, this too could be treated as a duplicate observation. For example, if in the year 2002 a firm changed financial year end from May to June, Compustat will capture both in subsequent rows. The May observation and the June observation will have 2 overlapping months (May and April). To remove all such cases of overlap, you will need to consider the observations where the datacqtr is a null/ blank. Doing so will subsume all cases of exactly duplicate observations as well. But just using the ‘datacqtr is a null’ criteria will not help uniquely identify exact duplicates. This is what I found out based on a subset of data, though I might have missed some additional possibilities.

    • Kai Chen says:

      Thanks for your comments. I think COMPUSTAT rule is saying “retain observations with non-missing datafqtr (i.e., fiscal quarter instead of calendar quarter). However, even if COMPUSTAT set a rule on how to create datafqtr and datacqtr), I find they hasn’t applied that rule in a consistent way. That’s why I think their tip is not reliable. I don’t think overlapping months in your example is a problem—as long as the observation reports a quarterly number (three-month), I’m fine with it. I recently updated the strategy to remove duplicates, which may provide new perspectives. Thanks again for your inputs.

  5. Alberto says:

    Thank you for you post! I was wondering what is the date variable in quarterly compustat if I want to merge it to monthly CRSP? “datadate” or “datacqtr”?

  6. Yuchen says:

    Hi chen. I always benefit from reading your posts. They are informative and practical. One simple question: when i extract the annual data from Compustata (Stata), I will get more 500,000+ observations for the period from 1950 to 2010. Even if I restrict my sample firms to those that I could identifiy them through CCM table, the firm-years are about 300,000. However, manny papers use much smaller sample size. I would like to know what restrictions you may impose on the sample selection?

  7. Francois Durant says:

    Very useful post Kai, thanks!

    It would be even more helpful to extend it up to the point where you actually tsset the dataset in quarters. It turns out that you need to generate a quarterly date from datadate and this new variable, used to tsset, also has duplicates. How would you manage this situation / optimally drop these duplicates?

  8. Jacob says:

    Hi Kai,

    I recently downloaded it and found that for annual data, the condition “indfmt==”INDL” & datafmt==”STD” & popsrc==”D” & consol==”C”” does not prevent duplicates now. It used to work, but now there are still several duplicates. When I set the CURCD == USD, there are no duplicates any more. But the sample of course is smaller.

  9. Wen Lin says:

    Hi Kai,

    Thanks for your post.

    I still have one question in terms of repeated GVKEY-FYEAR. Since my research needs to use the GVKEY and FYEAR to merge with another dataset. Even though I have the unique GVKEY-DATADATE, I could identify repeated GVKEY-FYEAR. Can I keep the GVKEY-FYEAR with the most recent DATADATE? Many thanks.

  10. PW says:

    Hi Kai,

    Thanks for your post.

    I don’t know if my understanding is wrong. But the codes ‘by gvkey: ge newfyr=fyr[_n+1] if last==1’ will drop the obs with duplicates happen in the final year of each gvkey. As the fyr[_n+1] will be none (.). In this case, newfyr == .

  11. Chen says:

    Hi Kai, thank you very much for this post. Could you please let me know what the “FS” and “INDL” means in Compustat? I searched for it but found no answer

    • Kai Chen says:

      From Computstat data dictionary: This code indicates whether a company reports in a Financial Services or Industrial format. Industry Format (INDFMT) describes the general industry presentation for the associated data record. This allows you to view a company (such as Aetna) as an industrial company or as a financial serves company with all the associated data items, thus making it easier to dissect a company.

Leave a Reply

Your email address will not be published. Required fields are marked *