The following documentation is the result of collaboration between the HathiTrust Research Center, JSTOR, and Portico intended to result in a common vocabulary for the exchange of extracted features datasets derived from a number of text object types.
Editors: Boris Capitanu, Timothy W. Cole, Jacob Jett, Deren Kudeki, Amy Kirchoff, Ted Lawless, Ron Synder, Dongqing Xie, J. Stephen Downie
The simplified data model (Figure 1 below) is designed to seamlessly fold JSTOR and Portico’s fine-grained article database metadata with the HathiTrust’s coarse-grained MARC records.
Figure 1: Simplified Extracted Features Data Model
The primary expectation is that the resulting Extracted Features (EF) datasets will be serialized through JSON-LD that conforms to the accompanying JSON-LD context document. Figures 2-6 (below) graphically showcase the overall document model for the EF JSON files.
Figure 2: JSON-LD Document Model Root/Header Section
Figure 3: JSON-LD Volume/Page Features Section
Figure 4: JSON-LD Header/Body/Footer Features Sections
Figure 5: JSON-LD Metadata Section (Attributes)
Figure 6: JSON-LD Metadata Section (Entity Relationships)
The following tables define the keys to be used in the JSON-LD EF serialization. Users should note that additional keys conforming to additional schema.org entity properties may also appear in some EF data files, especially those produced by JSTOR/Portico which have finer-grained metadata from which to draw information. Among these additional keys (not defined here) users may see: pageEnd, pageStart, and pagination.
Note: The primary purpose of this documentation is to explicate the specific semantics and intended use of each key in the schema. As such, no information is provided regarding the provenance of the data that appears as values for those keys.
Table 1: EF Common JSON Object Keys
Key Name |
Definition |
id |
This key is used to identify node objects within the JSON document. Note that when present, its value MUST be null, an absolute URL, a relative URL, or a compact URL. Note also in some cases this id will be synonymous with previously used ids (in older schema versions) such as the value used by the deprecated “volumeIdenitifier” key. See JSON-LD spec (https://json-ld.org/spec/latest/json-ld/#node-objects) for additional information. |
type |
This key used to assert that the node is a particular kind of entity. If the node object contains the @type key, its value MUST be either an absolute URL, a relative URL, a compact URL, a term defined in the active context expanding into an absolute URL, or an array of any of these. See section 3.4, Specifying the Type, for further discussion on @type values. See JSON-LD spec (https://json-ld.org/spec/latest/json-ld/#node-objects) for additional information. The following types are typically used in EF files for bibliographic entities (including the EF file and its features and metadata parts): Agents will typically have one of the following types: |
Table 2: Extracted Features File Header and Primary Objects
context |
This key is used to provide the context document that aliases all of the document’s various keys to their correct namespace references. When present, its value MUST be null, an absolute URL, a relative URL, or a compact URL. |
schemaVersion |
This key is used to convey version information for the EF JSON-LD document’s overall schema as well as those portions particular to the metadata and features objects. When present, its value MUST be null, an absolute URL, a relative URL, or a compact URL. |
creator |
This key records which institution generated the EF file. It is semantically synonymous with schema.org’s creator property (https://schema.org/creator). |
publisher |
This key is used to convey information about the agent who published a bibliographic entity. It is semantically synonymous with schema.org’s publisher property (https://schema.org/publisher). Note that here at the top level the publisher is the institution publishing the Extracted Features dataset (e.g., HathiTrust Research Center). |
name |
This key records a name label for an entity. It is semantically synonymous with schema.org’s name property (https://schema.org/name). Scope Note: This key is intended to be used for node objects representing agents (e.g., publisher above). For node objects that represent bibliographic entities, see title below. |
datePublished |
This key records the date information regarding when the JSON-LD file was published. Note that this can vary from the dates the data is created at. It is semantically synonymous with schema.org’s datePublished property (see: https://schema.org/datePublished). |
htid |
This key records an ambiguous name for the family of objects related to a particular “volume” in the HathiTrust context. It is reused in the identifiers that denote various versions of the “volume” (e.g., the OCR text files that compose it, a book on a shelf, etc.), metadata records describing the “volume”, and even this extracted features JSON-LD file. When present, this key’s value MUST be null or a string. |
metadata |
This key is used to demarcate the section of the EF file that contains metadata describing the entity from which the EF data was generated. When present, the value of this key MUST be null or a node object. Semantically, the metadata predicate is a sub-property of schema.org’s dataFeedElement property (https://schema.org/dataFeedElement). Scope Note: This key is used specifically for denoting the part of the data feed that communicates metadata about the bibliographic object that was analyzed to produce the features. |
features |
This key is used to indicate the section of the JSON-LD serialization that contains the actual EF data. When present, the value of this key MUST be null or a node object. Semantically, the features predicate is defined as a sub-property of schema.org’s dataFeedElement property (https://schema.org/dataFeedElement). Scope Note: This key is used specifically for denoting the part of the dataset that communicates the data that was generated by analysis of the bibliographic object described by the metadata block. |
schemaVersion |
This key is used to convey version information for the metadata-specific portion of the EF schema. When present, its value MUST be null, an absolute URL, a relative URL, or a compact URL. |
dateCreated |
This key records the date information regarding when the features data in this section was created. It is semantically synonymous with schema.org’s dateCreated property (see: https://schema.org/dateCreated). |
pageCount |
This key is conveys the number of pages in the “volume”. When present, the value of this key MUST either be null or an integer. |
pages |
This key is used to indicate the section of the JSON-LD serialization that contains the page-level EF data. When present, the value of this key MUST be null or an array. Semantically, the pages predicate is defined as a sub-property of schema.org’s hasPart property (https://schema.org/hasPart). Scope Note: This key is used specifically for denoting the part of the dataset that communicates the data. For other meronymic relations see isPartOf and hasPart above. |
seq |
This key conveys the relative position of a page within a “volume”. When present, the value of this key MUST be null or an integer. |
version |
This key provides an MD-5 Hash that identifies the version of the analyzed page. When present, the value of this key MUST be either a string or null. |
calculatedLanguage |
This key provides the algorithmically estimated language (the one with the highest probability value) of the text on the page. When present, the value of this key MUST be either a string or null. |
tokenCount |
The total number of tokens in the section (e.g., page, header, body, or footer) indicated by the node object. When present, the value of this key MUST be null or an integer. |
lineCount |
The total number of non-empty lines in the section (e.g., page, header, body, or footer) indicated by the node object. When present, the value of this key MUST be null or an integer. |
emptyLineCount |
The total number of empty lines in the section (e.g., page, header, body, or footer) indicated by the node object. When present, the value of this key MUST be null or an integer. |
sentenceCount |
The total number of sentences in the section (e.g., page, header, body, or footer) indicated by the node object. When present, the value of this key MUST be null or an integer. |
header |
This key is used to indicate the section of the JSON-LD serialization that contains the actual EF data. When present, the value of this key MUST be null or a node object. Semantically, the header predicate is defined as a sub-property of schema.org’s hasPart property (https://schema.org/hasPart). Scope Note: This key is used specifically for denoting the part of the dataset that communicates the data. For other meronymic relations see isPartOf and hasPart above. |
body |
This key is used to indicate the section of the JSON-LD serialization that contains the actual EF data. When present, the value of this key MUST be null or a node object. Semantically, the body predicate is defined as a sub-property of schema.org’s hasPart property (https://schema.org/hasPart). Scope Note: This key is used specifically for denoting the part of the dataset that communicates the data. For other meronymic relations see isPartOf and hasPart above. |
footer |
This key is used to indicate the section of the JSON-LD serialization that contains the actual EF data. When present, the value of this key MUST be null or a node object. Semantically, the footer predicate is defined as a sub-property of schema.org’s hasPart property (https://schema.org/hasPart). Scope Note: This key is used specifically for denoting the part of the dataset that communicates the data. For other meronymic relations see isPartOf and hasPart above. |
tokenPosCount |
This key conveys the tokens that appear in a section. When present, the value of this key must either be null or an array. Scope Note: An unordered list of all tokens (characterized by part of speech using OpenNLP), and their corresponding frequency counts, in this page section. Tokens are case-sensitive, so a capitalized "Rose" is shown as a separate token. There will be separate counts, for instance, for "rose" (noun) and "rose" (verb). Words separated by a hyphen across a line break are rejoined. No other data cleaning or OCR correction was performed. |
beginCharCount |
This key conveys the aggregated frequency counts of the first non-whitespace character on each line. When present, this key’s value MUST either be null or an array. |
endCharCount |
This key conveys the aggregated frequency counts of the last non-whitespace character on each line. When present, this key’s value MUST either be null or an array. |
capAlphaSeq |
This key conveys the longest length of the alphabetical sequence of capital characters starting a line. When present, this key’s value MUST either be null or an integer. Scope Note: This key only appears in node objects that are the value of body keys. |
schemaVersion |
This key is used to convey version information for the metadata-specific portion of the EF schema. When present, its value MUST be null, an absolute URL, a relative URL, or a compact URL. |
dateCreated |
This key records the date information regarding when metadata in this section was created. Note that this date is not reflective of the metadata’s original creation date, just the most recent date a new version (the version used here) of the original metadata was created. It is semantically synonymous with schema.org’s dateCreated property (see: https://schema.org/dateCreated). |
title |
This key is used to convey the title of the entity the node object represents. Semantically, the title predicate is defined as a sub-property of schema.org’s name property (https://schema.org/name). Scope Note: This key is intended to be used for node objects representing bibliographic entities. For node objects that represent agents, see name above. |
journalTitle |
This key is used to convey the title the entity that a Journal node object represents. This predicate is defined as a sub-property of the title predicate (above). Scope Note: This is key is used for node objects representing entities that possess the schema:Journal class. |
issueTitle |
This key is used to convey the title the entity that a Publication Issue node object represents. This predicate is defined as a sub-property of the title predicate (above). Scope Note: This is key is used for node objects representing entities that possess the schema:PublicationIssue class. |
alternateTitle |
This key is used to convey an additional title of the entity the node object represents. This predicate is defined as a sub-property of the title predicate (above). Scope Note: This key is intended to be used for node objects representing bibliographic entities. For node objects that represent agents, see name above. It may be used in conjunction with the journalTitle and issueTitle keys in addition to the title key. This key may have an array of strings as its value. |
enumerationChronology |
This key is used to convey information regarding which volume, issue, or anum a creative work is. When present, the value of this key must be null or a string. Typically, the string value is derived from a part of the title string. |
issueNumber |
This key is used to convey enumeration information regarding a publication issue. When present, the value of this key must be null, an integer, or a string. It is semantically synonymous with schema.org’s issueNumber property (https://schema.org/issueNumber). |
volumeNumber |
This key is used to convey enumeration information regarding a publication volume. When present, the value of this key must be null, an integer, or a string. It is semantically synonymous with schema.org’s volumeNumber property (https://schema.org/volumeNumber). |
publisher |
This key is used to convey information about the agent who published a bibliographic entity. It is semantically synonymous with schema.org’s publisher property (https://schema.org/publisher). |
pubPlace |
This key is used to convey information regarding where a bibliographic entity was first published. It is semantically synonymous with schema.org’s location property (https://schema.org/location). |
pubDate |
This key is used to convey information regarding when a bibliographic entity was first published. It is semantically synonymous with schema.org’s datePublished property (https://schema.org/datePublished). |
genre |
This key is used to convey information regarding the genre of a bibliographic entity. It is semantically synonymous with schema.org’s genre property (https://schema.org/genre). |
category |
This key is used to convey information regarding the overall topic of a bibliographic entity. When present its value must be NULL, a string value, or an array of string values. It is semantically synonymous with schema.org’s about property (https://schema.org/about). Scope Note: Note that the strings in the category conform to an established standard mapping from Library of Congress Classification numbers (LCC) to a taxonomy [using string data] describing broad topical categories maintained by the University of Michigan (https://www.lib.umich.edu/browse/categories/). |
subjects |
This key is reserved for future use to convey topical information regarding the bibliographic entity. |
language |
This key is used to convey information regarding the primary language of a bibliographic entity. It is semantically synonymous with schema.org’s inLanguage property (https://schema.org/inLanguage). Scope Note: Note that this key is also used in the features section to annotate the primary language each analyzed page possessed. |
accessRights |
This key is used to convey the copyright status of the bibliographic entity from which the EF were created. When present this key’s value must either be null or a string. Scope Note: When the EF file’s source is the HTRC, string values for this key will conform to the attributes described by the HathiTrust’s rights database (https://www.hathitrust.org/rights_database#DatabaseLayout). EF files generated by JSTOR/Portico also conform to these attributes but typically only use the “pd”, “ic”, and “und” values. |
isAccessibleForFree |
This key is used as a coarser method than the previous key for indicating when something is publicly available. It is semantically synonymous with schema.org’s isAccessibleForFree property (https://schema.org/isAccessibleForFree). Scope Note: End users should note that the value of this key may not match the actual accessibility state as copyright reassessments are occurring for the works in the dataset all of the time. See related lastRightsUpdateDate key below. |
lastRightsUpdateDate |
This key is used to convey the most recent date that the access rights for a bibliographic entity were updated on. It is semantically synonymous with schema.org’s dateModified property (https://schema.org/dateModified). |
contributor |
This key replaces the older names key (see list of deprecated keys below). It is semantically synonymous with schema.org’s contributor property (https://schema.org/contributor), except that it may additionally have as a value, an array. Scope Note: As JSTOR/Portico have richer datasets from which to draw information from, narrower keys indicating richer role information, including author, editor, etc., MAY appear in some EF files. These keys are semantically synonymous with analogous properties defined by the schema.org standard. See: |
typeOfResource |
This key is used to indicate the coarse-grained format of a bibliographic entity, e.g., text, image, etc. When present, this key’s value must either be null or a string. |
sourceInstitution |
This key is used to indicate which institution provided a bibliographic entity. It is semantically synonymous with schema.org’s provider property (https://schema.org/provider). |
isPartOf |
This key is used to convey information regarding an entity’s meronymic relation to another entity. When present, the value of this key MUST be null or a node object. It is semantically synonymous with schema.org’s isPartOf property (https://schema.org/isPartOf). Scope Note: This key is expected to be used in JSTOR and Portico EF files where article-issue-volume-series chains are better supported by the data. In this manner JSTOR and Portico will be able to provide additional relevant metadata at each level of work granularity. |
hasPart |
This key is used to convey information regarding an entity’s meronymic relation to another entity. When present, the value of this key MUST be null or a node object. It is semantically synonymous with schema.org’s hasPart property (https://schema.org/hasPart). Scope Note: This key is expected to be used in JSTOR and Portico EF files where article-issue-volume-series chains are better supported by the data. In this manner JSTOR and Portico will be able to provide additional relevant metadata at each level of work granularity. Note that the features key (described below) also maps to schema’s hasPart relation. |
mainEntityOfPage |
This key is used to indicate additional sources of metadata that describe a bibliographic entity, e.g., such as through links to catalog records. It is semantically synonymous with schema.org’s mainEntityOfPage property (https://schema.org/mainEntityOfPage) except that it may additionally have as a value, an array. |
issn |
This key is used to convey a creative work series’ ISSN number. It is semantically synonymous with schema.org’s issn property (https://schema.org/issn). |
isbn |
This key is used to convey a book’s ISBN number. It is semantically synonymous with schema.org’s isbn property (https://schema.org/isbn). |
sourceInstitutionRecordNumber |
This key is used to convey a media object’s institution specific identifier. This predicate is defined as a sub-property of Schema’s identifier property. |
oclc |
This key is used to convey the identifier assigned to the media object by OCLC. This predicate is defined as a sub-property of Schema’s identifier property. |
lccn |
This key is used to convey the Library of Congress Control Number assigned to the media object. This predicate is defined as a sub-property of Schema’s identifier property. |
classification |
This key is used to convey all other classification identifiers assigned to the media object. This predicate is defined as a sub-property of Schema’s identifier property. |
Table 5: List of Deprecated Keys
Key |
Reason for Deprecation |
names |
Superseded by more informing keys (e.g., contributor, author, etc.). |
imprint |
Superseded by publisher, pubPlace, and pubDate |
hathiTrustRecordNumber |
Superseded by htid |
htBibUrl |
Integrated into mainEntityOfPage array. |
handleURL |
Used as node ID for features and metadata node objects |
lastUpdateDate |
Renamed to lastRightsUpdateDate to better reflect purpose. |
languages |
Renamed to calculatedLanguage to better reflect purpose. |
governmentDocument |
Superseded by more accurate information appearing in genre. |