User Tools

Site Tools


International drug vocabulary implementation process

The most straightforward way to standardize a local drug Vocabulary is to utilize RxNorm Extension logic to extend standard Drug concept pool and map source drug products to concepts in RxNorm and RxNorm Exterension.

Combined target structure

Standard concepts in Drug domain are placed in a single comprehensive hierarchy based on their attributes. To correctly implement a vocabulary in CDM and find or build a counterpart for each source drug concept the following attributes must be extracted:

  1. Ingredients: active substance(-s) in pharmacological preparation. Examples: Aspirin, Trastuzumab, Ibuprofen etc.;
  2. Dose form: form in which drug is administered to patient. Examples: Oral Tablet, Transdermal System, Injectable Suspension;
  3. Marketed/Brand Name: Nurofen, Kadcyla, Prevenar 13 etc.;
  4. Dosage/strength of active substance, total volume of liquid drug or number of actuations for dosed inhalers or applicators;
  5. Information about combination pack contents, e.g. contraceptives calendar packs with varying dosages and ingredient combinations;
  6. Box size: number of unique package units (like bottles, blisters or syringes) in a packaged box.
  7. Manufacturer/supplier name (Reckitt Benckiser, TEVA, Ratiopharm etc). Generally, local branch and general name of international entity are considered same (“Teva Europe” is the same as “Teva”).

Such attributes may be given explicitly in a well-structured source vocabulary or have to be extracted from drug product names. In case of drug products or discrete attributes like Ingredients or Brand Names not having their own codes, new codes have to be constructed combining the word “OMOP” and a running number. The running number should be unique across all vocabularies. That means, each time a new vocabulary is added or refreshed, the next Concept Code should be the one of the last (without the 'OMOP' string) +1.

All source vocabulary concepts, extracted attributes, dosage and packaging information must be staged in standardized format of input tables, that can processed to be included into OHDSI Standardized Vocabularies.

Challenges and problems

To implement a tool to create and maintain the above structure, a number of issues need to be taken care of:

  • Excipients: There is no general agreement of what is an active agreement and what is an excipient. Therefore, some of the ingredients need to be declared as “semiactive”, such as gelatine. Generally, if ingredient can be biologically active, but is not present in preparation for its pharmaceutical properties, it should be considered excipient. Excipients should be excluded from a list of a drug's ingredients.
  • Forms: These are not used the same way across drug vocabularies. For example, RxNorm has a Form “Cream”, but also “Ophthalmic cream”, “Vaginal cream”, “Rectal cream”, “Oral cream” and “Cutaneous cream”, making this Form ambiguous. Instead of a one-to-one mapping, a one to many mapping with an order of precedence is required to establish matching equivalence between Forms.
  • Strength: RxNorm normalizes weight units to “mg” and volume units to mL, but other vocabularies might not. There might be units like “µg”, “gram-%” or “volume-%”. Special unit conversion tables are needed instead of simple unit mappings. This approach becomes infeasible if units are used where the conversion is dependent on the molecule, like “mol” or “equivalent”.
  • Ingredient forms: Ingredients might have ambiguous chemical forms, which RxNorm calls “Ingredient” and “Precise Ingredient” (e.g. a salt of the active compound). They have to be mapped to the right Standard RxNorm Ingredient. If there is no RxNorm Ingredient to map to and the drug vocabulary to be added contains several ambiguous forms of the same Ingredient, one of them has to be declared Standard. In rare cases there might be several Standard duplicates of the same Ingredient. In those cases mappings from source vocabularies must be made with precedence. Another problem might occur due to the strength is given for a precise ingredient, rather than a standard ingredient. An ingredient that is presented in the way of aqueous/spirit extract should be considered as the same one.

Creation of input tables

The new vocabulary should be prepared in the following input tables. Hereon, the DDL is given in PostgreSQL dialect. For other dialects, an appropriate data type has to be found. For example, for PostgreSQL we specifically set numbers to be NUMERIC and not FLOAT.

Field Required Type Description
concept_name Yes VARCHAR(255) An unambiguous, meaningful and descriptive name for the Concept in English language
domain_id Yes VARCHAR(20) A foreign key to the DOMAIN table. The standard content is 'Drug', but for non-drugs it could be 'Device' or 'Observation'
vocabulary_id Yes VARCHAR(20) A foreign key to the VOCABULARY table. The value of this field should be identical for all records, indicating the new vocabulary being added.
concept_class_id Yes VARCHAR(20) One of the above listed RxNorm Concept Classes
concept_code Yes VARCHAR(50) The code in the source vocabulary. If the source vocabulary does not contain a code, e.g. for ingredients or dose forms, they will be created automatically (see below OMOP created codes)
source_concept_class_id No VARCHAR(20) Concept class that is given by the source vocabulary
possible_excipient No VARCHAR(1) A flag only relevant to ingredients, indicating whether or not they are not active ingredients and could be omitted from an ingredient list. Currently ignored.
valid_start_date No DATE Date when the Concept became valid. This may or may not coincide with the date the product went to market. Default value is 01.01.1970, unless source gives explicit date.
valid_end_date No DATE Date when the Concept became invalid. Market withdrawal does not mean a Concept is invalid. Deprecated concepts have VALID_END_DATE of a day before update, unless source gives explicit date. VALID_END_DATE for all valid source concepts must be 31.12.2099
invalid_reason No VARCHAR(1) Flag indicating wether the Concept is active (today's date between valid_start and valid_end_date), or upgraded ('U') or deprecated ('D').

This table is expected to contain concepts having following Concept Classes:

  • Drug Product (Branded Drug, Clinical Drug, Marketed Product etc.)
  • Form
  • Brand Name
  • Ingredient
  • Supplier
  • Device (for source concepts falling outside of Drug cathegory)

It may contain Branded or Clinical Drug Forms or Components, but if not they will be derived (see below). Note that units should not necessarily have an entry in the DRUG_CONCEPT_STAGE table. Instead, they should be used as verbatim. If the precise Concept Class of a Drug Product is relevant, it can be preserved in source_concept_class_id field.

Brand Names that are simple combinations of generic international name of active substance and manufacturer name (e.g. “Aspirin Bayer”) should not appear as attributes for Drug Products. Manufacturer information should be stored as a concept with Supplier class.

Concepts that belong to the source vocabulary, but do not belong to Drug domain by OMOP CDM conventions, should be classified as 'Device'. Typically, these belong to different groups - check the list here.

Animal drugs can be handled as Drugs or Devices, depending on what their role in patient data can be expected to be. Note that only concepts from Drug domain can have attributes.


This table should contain the mapping between source codes and Standard Concepts for Ingredients, Brand Names, Dose Forms, Suppliers and Units. All other relationships will be ignored.

Field Required Type Description
concept_code_1 Yes VARCHAR(50) The source code
vocabulary_id_1 Yes VARCHAR(20) The source vocabulary
concept_id_2 Yes INTEGER The existing target Concept
precedence No SMALLINT For multiple concept_code_1/concept_id_2 combination the order of precedence in which they should be considered for equivalence testing. The mapping with the highest prevalence among the drugs will be used for writing a record to the CONCEPT_RELATIONSHIP table. A missing precedence will be interpreted as precedence 1. Every precedence value should be unique per concept_code_1
conversion_factor No NUMERIC The factor used to convert the source code to the target Concept. This is usually defined for units

This table should contain all mappings from the new to existing Concepts and their precedence.

Units should be mapped to Standard Concept Units. Weight units should be converted to milligram, volume units should be mapped to milliliter, molar - to millimole with the right conversion factor. The source_code field should contain the verbatim string of the unit. It is highly desirable to only use units that are in use by Standard native RxNorm concepts. Query DRUG_STRENGTH table for a distinct list.

Ingredients are usually mapped to Standard concepts one to one. If ingredient is given as a mix (e.g. Co-dried gel of Magnesium Carbonate and Aluminium Hydroxide), it should be split in multiple entities with distinct new codes; each component of the mix must be mapped to standard ingredient.

One to many mappings with precedence should be used if:

  • Source ingredient is an ion (like calcium, iron, zinc, etc.), which should be mapped to all it's salts;
  • Source ingredient is a herbal extract, which should be mapped to all suitable standard concepts;
  • Target vocabularies contain logical duplicates among standard ingredients. This is rare. Example: RxNorm contains both 19026739 Pantothenic Acid and 19088079 pantothenate as separate standard ingredients (as of July 2019).

Dose Forms are commonly mapped to multiple RxNorm dose forms with precedence. Modified release forms should be first mapped to corresponding forms in RxNorm vocabulary (like Delayed Release Oral Capsule), and then to more generic forms (Oral Capsule) with lower precedence.

concept_code_1YesVARCHAR(50)One source code of the pair
concept_code_2YesVARCHAR(50)The other source code of the pair

This table should contain relationships for each Drug Concept: To the Ingredients (always), the Dose Form (if appropriate),the Supplier (if appropriate) and the Brand Name (if appropriate). All other relationships will be derived and ignored if they exist in the table. The relationships don't need to be symmetrical, only the one initiating from the Drug Concept is required.

If Drug Product concept does not have an Ingredient attribute, it will be non-standard (as all source concepts) and not have any standard mapping target after processing. Supplier attribute will not be considered for concepts without DS_STAGE or PC_STAGE entry since Marketed Product concepts can't exist without dosage information.

drug_concept_codeYes VARCHAR(50)The source code of the Drug or Drug Component, either Branded or Clinical
ingredient_concept_codeYes VARCHAR(50)The source code for one of the Ingredients
amount_valueNoNUMERICThe numeric value for absolute content (usually solid formulations)
amount_unitNo VARCHAR(255)The verbatim unit of the absolute content (solids)
numerator_valueNoNUMERICThe numerator value for a concentration (usually liquid formulations)
numerator_unitNo VARCHAR(255)The verbatim numerator unit of a concentration (usually liquid formulations)
denominator_valueNoNUMERICThe denominator value for a concentration (usually liquid formulations). It should contain a number for Quantified products, and null for everything else.
denominator_unitNoVARCHAR(255)The verbatim denominator unit of a concentration (usually liquid formulations)
box_sizeNoSMALLINTThe number of units per box

This table contains the dose of each ingredient in each drug, as well as the box_size. For drugs which have no strength information or have only for some of the containing ingredients, the ds_stage record must be omitted. '0' values in ds_stage are only allowed for inert drugs. Drug ingredients should match those in internal_relationship_stage. If ingredients are mapped to the same one in relationship_to_concept their dosages should be summed up as for a single ingredient before processing. A drug should not contain ingredients in solid (amount) and liquid (numerator/denominator) form. This might be caused be either source data aberration or drug pack, which must be split into separate Drug Products and processed in PC_STAGE table. If denominator value is given, quantified drug will be created with given denominator value and unit as total volume.

  • Inhalers, enemas or sprays that release certain dosage of active ingredient per activation should also be stored in numerator/denominator form with total number of actuations as denominator (e.g. X MG / Y ACTUAT).
  • All drugs with fixed amount must have dosage in amount fields and all solutions must have dosage filled in numerator and denominator fields. When liquid drugs in data contain concentration information without volume, DENOMINATOR_VALUE field is left empty.
  • Gases for inhalation must be put in numerator fields with % in unit field without filling denominator fields. It’s the only acceptable use of percents in DS_STAGE. Make sure to convert everything else to mg/ml or mg/mg.
  • Patches, drug implants and other forms that release molecules over a period of time (even extended release tablets or capsules) may also be stored in numerator/denominator form (e.g. X MG / Y HOUR).
  • If dose form for the source concept is given as a soluble powder without a solvent (except powder inhalers), dosage is stored in amount field.
  • For drugs that are administered in a form of oral liquid (solution, suspension, syrup), denominator value of 5 ML should be kept only when we are certain that the dosage is not given “per tbsp.”; if 5 ML is not an actual fixed administered dose (e.g. a sachet or vial), it should be treated as concentration (DENOMINATOR_VALUE = NULL).
  • Box size equal to 1 should be simply stored as NULL.
  • Drugs can't have differing information for denominators among different ingredients, skip dosage for some ingredients or have same ingredient with different dosages
pack_concept_codeYesVARCHAR(50)The source code of the Pack, either Branded or Clinical
drug_concept_codeYesVARCHAR(50)The component drug product in the Pack
amountNo SMALLINT The number of units of the drug product in a pack
box_sizeNo SMALLINT The number of packs if the pack is boxed (several packs in a larger container

This table contains the composition of a Clinical or Branded Pack: The Clinical or Branded Drug and, number of doses in each box and number of boxes in each pack. If it is a boxed Pack, it will also contain the box size, since Packs have no records in DS_STAGE like the other drug products. Packs are allowed to have branded drugs as components, although usually Brand Name is only attributed to packs as a whole. Supplier may only be attributed to the pack as a whole.

Box size = 1 should be omitted.


This table contains alternative names for concepts. These are either alternative names provided by source or names in original languages, since DRUG_CONCEPT_STAGE will contain english names only.

synonym_concept_idINTEGERAlways and empty field in this table
synonym_nameVARCHAR(255)Alternative name of the concept. There is no need to copy the entry from DRUG_CONCEPT_STAGE
synonym_concept_code VARCHAR(50)Concept code in source vocabulary
synonym_vocabulary_id VARCHAR(20)VOCABULARY_ID of source vocabulary
language_concept_id INTEGER CONCEPT_ID for Standard concept representing language

This table allows to manually map source concepts to existing standard concept in OMOP CDM circumventing standard vocabulary building process. This is useful to represent source concepts that should be mapped to a concept from non-drug domain, preserve relationships inside the vocabulary or to map concepts to standard Drug concepts from outside of RxNorm and RxNorm Extension logic (e.g. standard concepts in CVX vocabulary. It is also recommended to provide manual mapping for drugs that may have poor source representation yet are usually of special interest for researchers, like insulins or vaccines.

concept_code_1Yes VARCHAR(255)CONCEPT_CODE of source concept in either CONCEPT or DRUG_CONCEPT_STAGE tables
concept_code_2Yes VARCHAR(255)CONCEPT_CODE of target concept in either CONCEPT or DRUG_CONCEPT_STAGE tables
vocabulary_id_1Yes VARCHAR(20)VOCABULARY_ID value of source concept
vocabulary_id_2Yes VARCHAR(20)VOCABULARY_ID value of target concept
relationship_idYes VARCHAR(20)Indicates the type of relation from source to target; most usually will indicate equivalence mapping ('Maps to'). Must be one of the values from RELATIONSHIP table
valid_start_dateNoDATEDate when the relation became valid
valid_end_dateNoDATEDate when the relation became invalid
invalid_reasonNo VARCHAR(1)Non-null entry allows for manual deprecation of existing relationship. Deprecated relationships that are absent from CONCEPT_RELATIONSHIP table will not be added to Standardized Vocabularies

This table needs not to be symmetrical like CONCEPT_RELATIONSHIP; complementary relationships will be built automatically. Note that concepts with equivalence mappings in this table should not have relations to attributes in other input tables.

Quality of input tables

The input tables need to have the following quality requirements:

Rule If rule is violated
Each record should be unique in all tables. The processing will fail.
Concept Codes should be unique and should not repeat for different products. The processing will fail.
Combinations of product components should be unique. These are Ingredient-strength(s) combination, Dose Form, Brand Name, Quantity, Box size. Only the highest Concept Code is retained, and the other ones are treated as non-standard Concepts and mapped to the highest.
Each product should have links (records in INTERNAL_RELATIONSHIP_STAGE) to all their Ingredients. The product will be treated as if it had only the linked Ingredients. If no Ingredients are linked, the product will be processed into the CONCEPT_STAGE table, but as an orphan without any related Concept Classes.
Ingredients should be linked to their Standard counterparts if such concepts exist. These Ingredients are treated as new Standard Ingredients, which may lead to creation of duplicates.
Dose Forms should be linked to their valid counterparts if such concepts exist. These Dose Forms will be treated as new valid Concepts, which may lead to creation of duplicates.
Brand Names should be linked to their valid counterparts if such concepts exist. These Brand Names will be treated as new valid Concepts, which may lead to creation of duplicates.
All % in source dosages should be converted into mg/ml (mg) unless it is a gas. A drug would not be mapped to it's Standard Concept
Marketed Product (a drug that has relationship to it's supplier in INTERNAL_RELATIONSHIP_STAGE) should have both dosage and Dose Form The product won't be processed into CONCEPT_STAGE table.
Boxed drug should have both dosage and Dose Form. The product won't be processed into CONCEPT_STAGE table.
Product ingredients should match in INTERNAL_RELATIONSHIP_STAGE and DS_STAGE The processing will fail.
When mapping Ingredients, Dose Forms or other attributes are mapped to multiple targets precedence values must be present and unique for each source concept Processing will create orphaned meaningless branches of RxNorm Extension concepts.
Concepts with active 'Maps to' relations inside CONCEPT_RELATIONSHIP_MANUAL should not have any entries indicating relationships with attributes Redundant branches of RxNorm Extension or multiple mappings may be created

For quality assurance of input tables you can use drug_stage_tables_QA.sql script from project's github.

All propositions to add a new vocabulary into CDM may be submitted (optionally with prepared input tables) as issues on github.

implementation_international_drug_vocabulary.txt · Last modified: 2021/06/09 10:15 by adavydov