Scientists renamed 27 human genes in 2020 because Microsoft Excel was automatically converting their symbols into dates, corrupting data in supplementary files and published analyses. The HUGO Gene Nomenclature Committee (HGNC) changed symbols that looked like dates, such as SEPT1 and MARCH1, to forms that Excel would not auto-format, for example SEPTIN1 and MARCHF1. The goal was to stop recurrent, hard-to-detect errors caused when spreadsheets opened or imported gene lists.
What is the Excel gene name problem?
The problem arises when gene symbols resemble dates or scientific notation. When those strings appear in spreadsheets or CSV files, Excel often reinterprets them as dates or numbers. That silently changes the underlying values, so a gene named SEPT1 can become “Sep-01,” and a code like 11E10 can become 1.10E+11.
A 2016 analysis of genomics papers found that about 20% of Excel files in the literature contained gene name errors caused by automatic conversion to dates or numbers (Genome Biology).
Because gene symbols are widely used as row or column labels in gene expression matrices and supplementary datasets, even a few auto-conversions can propagate through pipelines, mislabel features, and distort downstream analyses and meta-analyses.
How does Excel convert gene symbols into dates?
Excel tries to infer data types when you type values, paste content, or open delimited files. If a string matches a date pattern, Excel converts it into a date serial number and displays a formatted date. This occurs both on direct entry and when opening CSV/TSV files unless the column is explicitly set to Text during import. Similar behavior affects long identifiers, phone numbers, postal codes, and product codes that can be turned into scientific notation or lose leading zeros.
Microsoft’s documentation states there is no global setting to disable the automatic date-conversion behavior, though there are workarounds such as preformatting as Text or prefixing with an apostrophe (Microsoft Support).
The upshot: if a collaborator opens a clean CSV in Excel with default settings, date-like gene symbols can be silently altered, saved, and redistributed with errors.
What changed in 2020 and which genes were affected?
To eliminate a persistent, systemic source of error, the HGNC updated 27 official human gene symbols in 2020 to avoid date-like strings. Two prominent sets were affected:
- SEPTIN family: symbols like SEPT1, SEPT2, … were standardized to SEPTIN1, SEPTIN2, … to avoid “Sep-01,” “Sep-02,” etc.
- MARCH family: symbols like MARCH1, MARCH2, … became MARCHF1, MARCHF2, … to avoid “Mar-01,” “Mar-02,” etc.
Several other symbols that resembled months or date patterns were also adjusted. The HGNC’s broader guidelines now discourage new symbols that can be misread by common software. You can consult current, authoritative symbols and approved aliases in the HGNC database and guidelines (HGNC).
How can researchers prevent Excel gene name errors?
Practical steps can reduce or eliminate the risk of auto-conversion:
- Prefer stable identifiers over symbols. Use Ensembl stable IDs (for example, ENSG… for genes) in analysis files, and map to human-readable symbols only at the presentation stage (Ensembl: Stable IDs).
- Avoid using Excel for raw or intermediate data. Process data in R, Python/pandas, or command-line tools, and save tab-delimited or CSV files that are not routinely opened in Excel.
- If you must use Excel, import safely. Open a blank workbook, use Data > Get Data or the Text Import Wizard, and set the gene symbol column’s type to Text. Alternatively, preformat the destination column as Text before pasting.
- Force text where needed. Prefix inputs with an apostrophe (‘) or add a leading zero and a space for fractions, as documented by Microsoft (support article).
- Validate on read. When ingesting shared files, scan for date serial numbers or unexpected date formats in gene columns. Flag and correct any conversions before analysis.
- Document corrections. Note any fixes to spreadsheet-induced errors in methods or data readme files to aid reproducibility.
Why does this still matter?
Legacy datasets, supplementary files, and lab templates created before 2020 still contain the old symbols, and many will be opened in Excel in the future. Follow-up analyses and meta-analyses can inherit gene label errors unless teams validate inputs.
A 2021 follow-up study reported that spreadsheet gene name errors continued to appear in newly published articles despite increased awareness, underscoring the need for better practices and tooling (Genome Biology).
The 2020 renamings remove one major trigger, but the broader lesson is to treat spreadsheets as presentation tools, not analysis systems. Use stable IDs, enforce explicit data types on import, and keep a validation step in every pipeline that touches gene labels.
