{"id":387,"date":"2015-08-21T17:43:29","date_gmt":"2015-08-21T21:43:29","guid":{"rendered":"http:\/\/csclub.uwaterloo.ca\/~k55chen\/?p=387"},"modified":"2024-06-24T14:11:35","modified_gmt":"2024-06-24T18:11:35","slug":"how-to-remove-duplicates-gvkey-datadate-when-extracting-firm-quarters-from-compustat-quarterly-fundq","status":"publish","type":"post","link":"https:\/\/www.kaichen.work\/?p=387","title":{"rendered":"How to remove duplicate GVKEY-DATADATE when using Compustat Annual (FUNDA) and Quarterly (FUNDQ) data?"},"content":{"rendered":"<p>The annual data (FUNDA) is easy to deal with; we just need to apply the following conditions:<\/p>\n<p>`indfmt==&#8221;INDL&#8221; &amp; datafmt==&#8221;STD&#8221; &amp; popsrc==&#8221;D&#8221; &amp; consol==&#8221;C&#8221;`<\/p>\n<p>If we have converted FUNDA to Stata format, the uniqueness of <code>GVKEY<\/code>&#8211;<code>DATADATE<\/code> can be verified using the following Stata command:<\/p>\n<p>`duplicates report gvkey datadate if indfmt==&#8221;INDL&#8221; &amp; datafmt==&#8221;STD&#8221; &amp; popsrc==&#8221;D&#8221; &amp; consol==&#8221;C&#8221;`<\/p>\n<p>This command should return &#8220;no duplicates&#8221;.<\/p>\n<p>The quarterly data (FUNDQ) is a bit more complicated. First of all, applying the same conditions won&#8217;t work. In fact, 99.7% observations in FUNDQ already satisfy these conditions. However, duplicate <code>GVKEY<\/code>&#8211;<code>DATADATE<\/code>s still exist in FUNDQ. The root cause of these duplicates is a firm changing its fiscal year-end. I use the following example for illustration:<\/p>\n<p><a href=\"https:\/\/www.kaichen.work\/wp-content\/uploads\/2024\/06\/Screen-Shot-2024-06-20-at-16.32.47-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2234\" src=\"https:\/\/www.kaichen.work\/wp-content\/uploads\/2024\/06\/Screen-Shot-2024-06-20-at-16.32.47-PM.png\" alt=\"\" width=\"1912\" height=\"1080\" srcset=\"https:\/\/www.kaichen.work\/wp-content\/uploads\/2024\/06\/Screen-Shot-2024-06-20-at-16.32.47-PM.png 1912w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2024\/06\/Screen-Shot-2024-06-20-at-16.32.47-PM-300x169.png 300w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2024\/06\/Screen-Shot-2024-06-20-at-16.32.47-PM-1024x578.png 1024w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2024\/06\/Screen-Shot-2024-06-20-at-16.32.47-PM-768x434.png 768w, https:\/\/www.kaichen.work\/wp-content\/uploads\/2024\/06\/Screen-Shot-2024-06-20-at-16.32.47-PM-1536x868.png 1536w\" sizes=\"(max-width: 1912px) 100vw, 1912px\" \/><\/a><\/p>\n<p>Variable definition: <code>FYEARQ<\/code> &#8211; fiscal year; <code>FQTR<\/code> &#8211; fiscal quarter; <code>FYR<\/code> &#8211; fiscal year-end month; <code>DATACQTR<\/code> &#8211; calendar quarter; <code>DATAFQTR<\/code> &#8211; fiscal quarter; <code>ATQ<\/code> &#8211; total assets; <code>NIQ<\/code> &#8211; quarterly net income; <code>NIY<\/code> &#8211; year-to-date net income.<\/p>\n<p>In this example, duplicates exist for three <code>DATADATE<\/code>s: 2010-03-31, 2010-06-30, and 2010-09-30. The data suggest that on March 31, 2010, the firm changed its fiscal year-end from March 31 to December 31 (i.e., FYR changed from 3 to 12). As a result, 2010-03-31 appeared twice in FUNDQ, once as fiscal 2009Q4 (based on the old fiscal year-end) and once as 2010Q1 (based on the new fiscal year-end). FUNDQ also reports additional duplicates for the subsequent two quarters (I don&#8217;t know why). Additionally, if we compare NIQ and NIY as highlighted in the red rectangle, the observation for fiscal 2009Q4 indicates NIY &gt; NIQ, which makes sense as NIY is a four-quarter sum and NIQ is single-quarter net income. In contrast, the observation for fiscal 2010Q1 indicates NIQ = NIY as both are single-quarter net income in this case.<\/p>\n<p>So, what&#8217;s the best strategy to remove duplicate <code>GVKEY<\/code>&#8211;<code>DATADATE<\/code>s?<\/p>\n<p>Before we answer this question, let&#8217;s take a closer look at duplicate <code>GVKEY<\/code>&#8211;<code>DATADATE<\/code>s in FUNDQ, which reveals that 99.8% of <code>GVKEY<\/code>&#8211;<code>DATADATE<\/code>s in FUNDQ are unique as of December 5, 2107. This suggests that no matter how we deal with duplicates, even simply delete all of them, our results probably won&#8217;t change in a noticeable way.<\/p>\n<p>That said, if we want to remove duplicates more carefully, COMPUSTAT provides the following clue:<\/p>\n<p>In the definition of <code>DATAFQTR<\/code>, COMPUSTAT notes that,<\/p>\n<blockquote><p>Note: Companies that undergo a fiscal-year change may have multiple records with the same datadate. Compustat delivers those multiple records with the same datadate but each record relates to a different fiscal year-end period.<\/p>\n<p>Rule: Select records from the co_idesind data group where datafqtr is not null, to view as fiscal data.<\/p><\/blockquote>\n<p>Unfortunately, I find that the suggested rule is not the best strategy because COMPUSTAT seems to set <code>DATAFQTR<\/code> as missing or non-missing inconsistently. In my opinion, the best strategy is to <strong>retain the <code>GVKEY<\/code>&#8211;<code>DATADATE<\/code> that reflects the most recent change of fiscal year-end<\/strong>. This means, in the above example, we should delete the following observations:<\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><code>DATADATE<\/code> = 2010-03-31 and <code>FYR<\/code> = 3<\/li>\n<li><code>DATADATE<\/code> = 2010-06-30 and <code>FYR<\/code> = 3<\/li>\n<li><code>DATADATE<\/code> = 2010-09-30 and <code>FYR<\/code> = 3<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Suppose we have converted FUNDQ to Stata format. The following Stata code will implement the above strategy. The code will also fill in missing <code>DATAFQTR<\/code> and remove duplicate <code>GVKEY<\/code>&#8211;<code>DATAFQTR<\/code>, which will later allow us to use the <code>tsset<\/code> command and perform lag and change calculations in Stata, e.g., to get beginning-of-quarter total assets or calculate quarterly changes in sales. Stata really shines in lag and change calculations for panel data\u2014a superb advantage over SAS.<\/p>\n<pre class=\"lang:default decode:true \">use fundq, clear\r\n\r\n\/\/ Keep necessary variables only for testing\r\nkeep gvkey datadate fyearq fqtr fyr datacqtr datafqtr atq saleq\r\n\r\n\/\/ Generate unique ID for stable sorting (keep original order when tied)\r\ngen _id = _n\r\n\r\n\/\/ Check if gvkey, datadate, and fyr are missing - NO\r\nassert !missing(gvkey, datadate, fyr)  \/\/ manually clean up missing values\r\n\r\n\/\/ Convert datacqtr and datafqtr to datatime\r\ngen _datacqtr = quarterly(datacqtr, \"YQ\")\r\ngen _datafqtr = quarterly(datafqtr, \"YQ\")\r\nformat _datacqtr _datafqtr %tq\r\ndrop datacqtr datafqtr\r\nrename _datacqtr datacqtr\r\nrename _datafqtr datafqtr\r\norder datacqtr datafqtr, after(fyr)\r\n\r\n\/\/ Check if datafqtr is a simple combination of fyearq and fqtr - YES\r\ngen _datafqtr = yq(fyearq, fqtr)\r\ngen _diff = _datafqtr - datafqtr\r\nsum _diff, detail\r\ndrop _datafqtr _diff\r\n\r\n\/\/ Check if datacqtr is simply based on datadate - NO. Jan, Apr, Jul and Oct will be coded as the previous quarter\r\ngen _datacqtr = qofd(datadate)\r\nformat _datacqtr %tq\r\ngen _diff = datacqtr - _datacqtr\r\nsum _diff, detail\r\ndrop _datacqtr _diff\r\n\r\n\/\/ Remove duplicate gvkey-datadate\r\nduplicates tag gvkey datadate, ge(dup)\r\ntab dup\r\nassert dup == 1 | dup == 0  \/\/ if false, manually clean up dup &gt;= 2\r\n\r\nsort gvkey datadate _id\r\n\r\nby gvkey: egen is_fyr_changed = max(dup)\r\ntab is_fyr_changed\r\n\r\nby gvkey: egen max_dup_gvkey_datadate = total(dup)\r\nsum max_dup_gvkey_datadate, detail\r\nlocal iter = r(max) - 2\r\n\r\ngen last=1 if dup==1 &amp; dup[_n+1]==0\r\nreplace last=0 if dup==1 &amp; last==.\r\n \r\nby gvkey: gen newfyr=fyr[_n+1] if last==1\r\nby gvkey: replace newfyr=newfyr[_n+1] if last==0 &amp; last[_n+1]==1\r\nforvalues i = 1\/`iter' {\r\n\tby gvkey: replace newfyr=newfyr[_n+1] if last==0 &amp; last[_n+1]==0\r\n}\r\n\r\ndrop if dup==1 &amp; fyr!=newfyr\r\n\r\nduplicates report gvkey datadate   \/\/ no duplicates\r\n\r\ndrop dup is_fyr_changed max_dup_gvkey_datadate last newfyr\r\n\r\nsave fundq_no_dup_gvkey_datadate, replace\r\n\r\n\/\/ Fill in missing datafqtr\r\n* generate next fiscal year end date\r\ngen _fyearend_1 = lastdayofmonth(mdy(fyr, 1, year(datadate)))\r\ngen _fyearend_2 = lastdayofmonth(mdy(fyr, 1, year(datadate)+1))\r\n\r\ngen fyearend = _fyearend_1 if datadate &lt;= _fyearend_1\r\nreplace fyearend = _fyearend_2 if datadate &gt; _fyearend_1\r\nformat fyearend %td\r\norder fyearend, after(datafqtr)\r\n\r\n* generate _datafqtr based on Compustat Manual\r\ngen _month_diff = mofd(fyearend) - mofd(datadate)\r\n\r\ngen _fyearq = year(fyearend)\r\nreplace _fyearq = year(fyearend) - 1 if month(fyearend) &lt;= 5\r\n\r\ngen _datafqtr = yq(_fyearq, 1) if _month_diff == 9\r\nreplace _datafqtr = yq(_fyearq, 2) if _month_diff == 6\r\nreplace _datafqtr = yq(_fyearq, 3) if _month_diff == 3\r\nreplace _datafqtr = yq(_fyearq, 4) if _month_diff == 0\r\nformat _datafqtr %tq\r\n\r\n* check if datafqtr is correct if datafqtr is not missing - YES\r\ngen _diff = datafqtr - _datafqtr if datafqtr!=.\r\nsum _diff, detail\r\ndrop _diff\r\n\r\n* replace missing datafqtr with computed _datafqtr\r\nreplace datafqtr = _datafqtr if datafqtr == .\r\n\r\n\/\/ Remove duplicate gvkey-datafqtr\r\nduplicates tag gvkey datafqtr, ge(dup)\r\ntab dup\r\n\r\nby gvkey: egen has_dup=max(dup)\r\ntab has_dup\r\n\r\ngsort - _id\r\nduplicates drop gvkey datafqtr, force\r\nsort _id\r\n\r\nduplicates report gvkey datafqtr  \/\/ no duplicates\r\ndrop dup has_dup _*\r\n\r\nreplace fqtr = quarter(dofq(datafqtr)) if fqtr==.\r\nassert !missing(gvkey, datadate, fyr, datafqtr, fqtr)  \/\/ manually investigate missing values\r\n\r\nsave fundq_no_dup_gvkey_datafqtr, replace<\/pre>\n<p>Please note: I also agree with one of the readers&#8217; comments that &#8220;(how to remove duplicates) depends on what you need&#8221;. For example, in one of my projects, I want to examine three-day CAR around earnings announcement date (<code>RDQ<\/code>) and use total assets as the deflator in my regression. As a result, when duplicate <code>GVKEY<\/code>&#8211;<code>DATADATE<\/code>s exist, the one with non-missing <code>RDQ<\/code> and <code>ATQ<\/code> will be preferred if I want to retain as many observations as possible.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The annual data (FUNDA) is easy to deal with; we just need to apply the following conditions: `indfmt==&#8221;INDL&#8221; &amp; datafmt==&#8221;STD&#8221; &amp; popsrc==&#8221;D&#8221; &amp; consol==&#8221;C&#8221;` If we have converted FUNDA to Stata format, the uniqueness of GVKEY&#8211;DATADATE can be verified using &hellip; <a href=\"https:\/\/www.kaichen.work\/?p=387\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11,9],"tags":[16,17],"_links":{"self":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts\/387"}],"collection":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=387"}],"version-history":[{"count":68,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts\/387\/revisions"}],"predecessor-version":[{"id":2254,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=\/wp\/v2\/posts\/387\/revisions\/2254"}],"wp:attachment":[{"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=387"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=387"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kaichen.work\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=387"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}