Skip to main content

Lehigh University Libraries - Library Guides

Citation Guides and Style Manuals: Data Citation

Data Citation

This page has information about what to include in a data citation. Also included is information about copyright and licensing.

Copyright Policy for Data

In addition to below, see the LTS webpages about copyright here.

What is copyright?
U.S. Copyright Office defines “copyright is a form of protection grounded in the U.S. Constitution and granted by law for original works of authorship fixed in a tangible medium of expression. Copyright covers both published and unpublished works”. Therefore, copyright is a form of intellectual property, working similar to a patent and a trademark.

What data is copyrightable?
In the U.S., facts are not copyrightable. For example, the fact that 1+1=2 is not copyrightable. Therefore, individual facts or data points are not copyrightable by applying this principle to data. For example, the water temperature was measured at 10 degrees at a lake on March 20th is a fact, and hence not copyrightable. What is copyrightable in the U.S. is a new or original way to select and arrange data. Using the phone book as an example, the information a phone book contains, including facts such as name, address and phone number is not copyrightable; however, the original arrangement of all the sequential ordering is copyrightable. For data, it means that the original selection and arrangement of data can be copyrighted. In addition, the associated metadata, that is, the documentation and descriptions of data or about the processing used to collect data are also copyrightable.

Who can claim copyright?
In the U.S., copyright is typically assigned by default. Therefore, you will have the copyright of the data you produced even if you haven’t officially filed for a copyright. However, copyright is not necessarily assigned to the people who create the data, but rather to the organizations for which they work. It’s important to read and understand the copyright and intellectual property policy of Lehigh University to be aware of when this might be the case for you.

(Adapted from "Mayernik, M. 2012. “Responsible Data Use: Copyright and Data.” In Data Management for Scientists Short Course, edited by Ruth Duerr and Nancy J. Hoebelheinrich, Federation of Earth Science Information Partners: ESIP Commons. doi: 10.7269/P31V5BWP")

Data Citation

Why do we need data citation?
Datasets generated in the research are equally valuable as the papers appearing at scientific journals, and should be treated as a citable source on par with traditional materials. To ensure these dataset assets permanently available for access and reuse, the arising data citation can enable researchers to create links between their academic publications and the underlying datasets.

What does a data ciation contain?

Author(s) Creator(s) of the dataset
Publication date Whichever is the later of: the date the dataset was made available, the date all quality assurance procedures were completed, and the date the embargo period expired.
Title As well as the name of the cited resource itself, this may also include the name of a facility and the titles of the top collection and main parent sub-collection (if any) of which the dataset is a part.
Edition The level or stage of processing of the data, indicating how raw or refined the dataset is.
Version A number increased when the data changes, as the result of adding more data points or re-running a derivation process, for example.
Feature name and URI The name of an ISO 19101:2002 'feature' (e.g. GridSeries, ProfileSeries) and the URI identifying its standard definition, used to pick out a subset of the data.
Resource type Examples: 'database', 'dataset'.
Publisher The organisation either hosting the data or performing quality assurance.
Unique numeric fingerprint (UNF) A cryptographic hash of the data, used to ensure no changes have occurred since the citation.
Identifier An identifier for the data, according to a persistent scheme.
Location A persistent URL from which the dataset is available. Some identifier schemes provide these via an identifier resolver service.

What should researchers be aware of when citing a dataset? 

Although the standardization and consistency in research data citation are still evolving, Ball and Duke(2012) from Digital Curation Center have summarzied some widely accepted practices in data citation for researchers to use: 

  1. If you have generated/collected data to be used as evidence in an academic publication, you should deposit them with a suitable data archive or repository as soon as you are able. If they do not provide you with a persistent identifier or URL for your data, encourage them to do so.
  2. When citing a dataset in a paper, use the citation style required by the editor/publisher. If no citation style is suggested, take a standard data citation style (e.g. DataCite’s) and adapt it to match the style for textual publications.
  3. Give dataset identifier in the form of a URL wherever possible, unless otherwise directed.
  4. Include data citations alongside those for textual publications. Some reference management packages now include support for datasets, which should make this easier.
  5. Cite datasets at the finest-grained level available that meets your need. If that is not fine enough, provide details of the subset of data you are using at the point in the text where you make the citation.
  6. If a dataset exists in several versions, be sure to cite the exact version you used.
  7. When you publish a paper that cites a dataset, notify the repository that holds the dataset, so it can add a link from that dataset to your paper.

(Adapted from "Alex Ball and Monica Duke, 2012. How to Cite Datasets and Link to Publications. In A Digital Curation Center 'working level' guide. Digital Curation Center".)

Data Licenses

Why license research data?
A data license will make clear the terms of using data, ensure a second party to understand what they are allowed to do with the data, and prevent infringing on the rights held during data reuse.

What data licenses are available?

License Option  General Information Pros Cons License Type
Creative Commons • Simple yet robust licenses for creative works.
• Have been used widely for most forms of original content, including data.
Good for:
• very simple, factual datasets
• data to be used automatically
Watch out for: • attribution stacking
• the NC (Non-Commercial) condition: only use with dual licensing 
• the SA (Share Alike) condition as it reduces interoperability
• the ND (No Derivatives) condition as it severely restricts reuse.


Public Domain

• The most permissive way of releasing data.
• All copyrights and database rights are waived, allowing the data to be used as freely as possible.
• Infringement becomes a non-issue.

Good for:
• most databases and datasets
• data to be used by anyone or any tool
• data to be used for any purpose

Watch out for:
• lack of control over how database is reused
• lack of protection against unfair competition


Open Data Commons

• Similar to Creative Commons licenses, but designed specifically for databases.

Depending on the license type Depending on the license type ODC-BY
Multiple Licensing • Used when none of the above licenses are satisfactory
• Usually employed in licensing the open source softwares.

How do I select a data license?

  1. Researchers should first check the policy of funding agencies or universities to find out whether they are obliged or strongly encouraged to use a certain license. In addition, some data centres or repositories have licenses that depositors must grant as a condition of deposit.
  2. If there is no policy for data licensing from funding or universities, researchers can look into the available license options for them and choose the appropriate license. 

(Adapted from "Ball, A. 2011. How to license research data. In: A Digital Curation Centre and JISC Legal 'working level' guide. Digital Curation Centre".)