3.7: Character and Categorical Data - Biology

3.7: Character and Categorical Data - Biology

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Scientific data sets, especially those to be analyzed statistically, commonly contain “categorical” entries. Sometimes these categories, while not numeric, have an intrinsic order, like “low,” “medium,” and “high” dosages.

Sadly, more often than not, these entries are not encoded for easy analysis. Consider the tab-separated file expr_long_coded.txt, where each line represents an (normalized) expression reading for a gene (specified by the ID column) in a given sample group. This experiment tested the effects of a chemical treatment on an agricultural plant species. The sample group encodes information about what genotype was tested (eitherC6orL4), the treatment applied to the plants (eithercontrolorchemical), the tissue type measured (eitherA,B, orCfor leaf, stem, or root), and numbers for statistical replicates (1,2, or3).

Initially, we’ll read the table into a data frame. For this set of data, we’ll likely want to work with the categorical information independently, for example, by extracting only values for the chemical treatment. This would be much easier if the data frame had individual columns forgenotype,treatment,tissue, andreplicateas opposed to a single, all-encompassingsamplecolumn.

A basic installation of R includes a number of functions for working with character vectors, but thestringrpackage (available viainstall.packes("stringr")on the interactive console) collects many of these into a set of nicely named functions with common options. For an overview, seehelp(package = "stringr"), but in this chapter we’ll cover a few of the most important functions from that package.

Splitting and Binding Columns

Thestr_split_fixed()function from thestringrpackage operates on each element of a character vector, splitting it into pieces based on a pattern. With this function, we can split each element of theexpr_long$samplevector into three pieces based on the pattern"_". The “pattern” could be a regular expression, using the same syntax as used by Python (and similar to that used by sed).

The value returned by thestr_split_fixed()function is a matrix: like vectors, matrices can only contain a single data type (in fact, they are vectors with attributes specifying the number of rows and columns), but like data frames they can be accessed with[, ]syntax. They may also have row and column names.

Anyway, we’ll likely want to convert the matrix into a data frame using thedata.frame()function, and assign some reasonable column names to the result.

At this point, we have a data frameexpr_longas well assample_split_df. These two have the same number of rows in a corresponding order, but with different columns. To get these into a single data frame, we can use thecbind()function, which binds such data frames by their columns, and only works if they contain the same number of rows.

A quickprint(head(expr_long_split))lets us know if we’re headed in the right direction.

At this point, the number of columns in the data frame has grown large, soprint()has elected to wrap the final column around in the printed output.

Detecting and %in%

We still don’t have separate columns fortissueandreplicate, but we do have this information encoded together in atissuerepcolumn. Because these values are encoded without a pattern to obviously split on,str_split_fixed()may not be the most straightforward solution.

Although any solution assuming a priori knowledge of large data set contents is dangerous (as extraneous values have ways of creeping into data sets), a quick inspection of the data reveals that the tissue types are encoded as eitherA,B, orC, with apparently no other possibilities. Similarly, the replicate numbers are1,2, and3.

A handy function in thestringrpackage detects the presence of a pattern in every entry of a character vector, returning a logical vector. For the columntissuerepcontaining"A1", "A3", "B1", "B2", "B3", "C1", ..., for example,str_detect(expr_long_split$tissuerep, "A")would return the logical vectorTRUE, TRUE, FALSE, FALSE, FALSE, .... Thus we can start by creating a newtissuecolumn, initially filled withNAvalues.

Then we’ll use selective replacement to fill this column with the value"A"where thetissuerepcolumn has an"A"as identified bystr_detect(). Similarly for"B"and"C".

In chapter 34, “Reshaping and Joining Data Frames,” we’ll also consider more advanced methods for this sort of pattern-based column splitting. As well, although we’re working with columns of data frames, it’s important to remember that they are still vectors (existing as columns), and that the functions we are demonstrating primarily operate on and return vectors.

If our assumption, that"A","B", and"C"were the only possible tissue types, was correct, there should be noNAvalues left in thetissuecolumn. We should verify this assumption by attempting to print all rows where thetissuecolumn isNA(using, which returns a logical vector).

In this case, a data frame with zero rows is printed. There is a possibility that tissue types like"AA"have been recoded as simple"A"values using this technique—to avoid this outcome, we could use a more restrictive regular expression in thestr_detect(), such as"^Ad$", which will only match elements that start with a single"A"followed by a single digit. See chapter 11, “Patterns (Regular Expressions),” and chapter 21, “Bioinformatics Knick-knacks and Regular Expressions,” for more information on regular-expression patterns.

A similar set of commands can be used to fill a new replicate column.

Again we search for leftoverNAvalues, and find that this time there are some rows where therepcolumn is reported asNA, apparently because a few entries in the table have a replicate number of0.

There are a few ways we could handle this. We could determine what these five samples’ replicate numbers should be; perhaps they were miscoded somehow. Second, we could add"0"as a separate replicate possibility (so a few groups were represented by four replicates, rather than three). Alternatively, we could remove these mystery entries.

Finally, we could remove all measurements for these gene IDs, including the other replicates. For this data set, we’ll opt for the latter, as the existence of these “mystery” measurements throws into doubt the accuracy of the other measurements, at least for this set of five IDs.

To do this, we’ll first extract a vector of the “bad” gene IDs, using logical selection on theidcolumn based therepcolumn.

Now, for each element of theidcolumn, which ones are equal to one of the elements in thebad_idsvector? Fortunately, R provides a%in%operator for this many-versus-many sort of comparison. Given two vectors,%in%returns a logical vector indicating which elements of the left vector match one of the elements in the right. For example,c(3, 2, 5, 1) %in% c(1, 2)returns the logical vectorFALSE, TRUE, FALSE, TRUE. This operation requires comparing each of the elements in the left vector against each of the elements of the right vector, so the number of comparisons is roughly the length of the first times the length of the second. If both are very large, such an operation could take quite some time to finish.

Nevertheless, we can use the%in%operator along with logical selection to remove all rows containing a “bad” gene ID.

At this point, we could again check forNAvalues in therepcolumn to ensure the data have been cleaned up appropriately. If we wanted, we could also checklength(bad_rows[bad_rows])to see how many bad rows were identified and removed. (Do you see why?)


While above we discussed splitting contents of character vectors into multiple vectors, occasionally we want to do the opposite: join the contents of character vectors together into a single character vector, element by element. Thestr_c()function from thestringrlibrary accomplishes this task.

Thestr_c()function is also useful for printing nicely formatted sentences for debugging.

The Base-R function equivalent tostr_c()ispaste(), but while the default separator forstr_c()is an empty string,"", the default separator forpaste()is a single space," ". The equivalent Base-R function forstr_detect()isgrepl(), and the closest equivalent tostr_split_fixed()in Base-R isstrsplit(). As mentioned previously, however, using these and otherstringrfunctions for this type of character-vector manipulation is recommended.

By now, factors have been mentioned a few times in passing, mostly in the context of usingstringsAsFactors = FALSE, to prevent character vectors from being converted into factor types when data frames are created. Factors are a type of data relatively unique to R, and provide an alternative for storing categorical data compared with simple character vectors.

A good method for understanding factors might be to understand one of the historical reasons for their development, even if the reason is no longer relevant today. How much space would thetreatmentcolumn of the experimental data frame above require to store in memory, if the storage was done naively? Usually, a single character like"c"can be stored in a single byte (8 bits, depending on the encoding), so an entry like"chemical"would require 8 bytes, and"control"would require 7. Given that there are ~360,000 entries in the full table, the total space required would be ~0.36 megabytes. Obviously, the amount of space would be greater for a larger table, and decades ago even a few megabytes of data could represent a challenge.

But that’s for a naive encoding of the data. An alternative could be to encode"chemical"and"control"as simple integers1and2(4 bytes can encode integers from -2.1 to 2.1 billion), as well as a separate lookup table mapping the integer1to"chemical"and2to"control". This would be a space savings of about two times, or more if the terms were longer. This type of storage and mapping mechanism is exactly what factors provide.[1]

We can convert a character vector (or factor) into a factor using thefactor()function, and as usual thehead()function can be used to extract the first few elements.

When printed, factors display theirlevelsas well as the individual data elements encoded to levels. Notice that the quote marks usually associated with character vectors are not shown.

It is illustrating to attempt to use thestr()andclass()andattr()functions to dig into how factors are stored. Are they lists, like the results of thet.test()function, or something else? Unfortunately, they are relatively immune to thestr()function;str(treatment_factor)reports:

This result illustrates that the data appear to be coded as integers. If we were to runprint(class(treatment_factor)), we would discover its class is"factor".

As it turns out, the class of a data type is stored as an attribute.

Above, we learned that we could remove an attribute by setting it toNULL. Let’s set the"class"attribute toNULL, and then runstr()on it.

Aha! This operation reveals a factor’s true nature: an integer vector, with an attribute of"levels"storing a character vector of labels, and an attribute for"class"that specifies the class of the vector as a factor. The data itself are stored as either1or2, but the levels attribute has"chemical"as its first element (and hence an integer of1encodes"chemical") and"control"as its second (so2encodes"control").

This special"class"attribute controls how functions likestr()andprint()operate on an object, and if we want to change it, this is better done by using theclass()accessor function rather than theattr()function as above. Let’s change the class back to factor.

Renaming Factor Levels

Because levels are stored as an attribute of the data, we can easily change the names of the levels by modifying the attribute. We can do this with theattr()function, but as usual, a specific accessor function calledlevels()is preferred.

Why is thelevels()function preferred over usingattr()? Because when usingattr(), there would be nothing to stop us from doing something irresponsible, like setting the levels to identical values, as inc("Water", "Water"). Thelevels()function will check for this and other absurdities.

What thelevels()function can’t check for, however, is the semantic meaning of the levels themselves. It would not be a good idea to mix up the names, so that"Chemical"would actually be referring to plants treated with water, and vice versa:

The reason this is a bad idea is that usinglevels()only modifies the"levels"attribute but does nothing to the underlying integer data, breaking the mapping.

Although we motivated factors on the basis of memory savings, in modern versions of R, even character vectors are stored internally using a sophisticated strategy, and modern computers generally have larger stores of RAM besides. Still, there is another motivation for factors: the fact that the levels may have a meaningful order. Some statistical tests might like to compare certain subsets of a data frame as defined by a factor; for example, numeric values associated with factor levels"low"might be compared to those labeled"medium", and those in turn should be compared to values labeled"high". But, given these labels, it makes no sense to compare the"low"readings directly to the"high"readings. Factors provide a way to specify that the data are categorical, but also that"low" < "medium" < "high".

Thus we might like to specify or change the order of the levels within a factor, to say, for example, that the"Water"treatment is somehow less than the"Chemical"treatment. But we can’t do this by just renaming the levels.

The most straightforward way to specify the order of a character vector or factor is to convert it to a factor withfactor()(even if it already is a factor) and specify the order of the levels with the optionallevels =parameter. Usually, if a specific order is being supplied, we’ll want to specifyordered = TRUEas well. The levels specified by thelevels =parameter must match the existing entries. To simultaneously rename the levels, thelabels =parameter can be used as well.

Now,"Water"is used in place of"control", and the factor knows that"Water" < "Chemical". If we wished to have"Chemical" < "Water", we would have needed to uselevels = c("chemical", "control")andlabels = c("Chemical", "Water")in the call tofactor().

Disregarding thelabels =argument (used only when we want to rename levels while reordering), because thelevels =argument takes a character vector of the unique entries in the input vector, these could be precomputed to hold the levels in a given order. Perhaps we’d like to order the tissue types in reverse alphabetical order, for example:

Rather than assigning to a separatetissues_factorvariable, we could replace the data frame column with the ordered vector by assigning toexpr_long_split$tissue.

We often wish to order the levels of a factor according to some other data. In our example, we might want the “first” tissue type to be the one with the smallest mean expression, and the last to be the one with the highest mean expression. A specialized function,reorder(), makes this sort of ordering quick and relatively painless. It takes three important parameters (among other optional ones):

  1. The factor or character vector to convert to a factor with reordered levels.
  2. A (generally numeric) vector of the same length to use as reordering data.
  3. A function to use in determining what to do with argument 2.

Here’s a quick canonical example. Suppose we have two vectors (or columns in a data frame), one of sampled fish species (“bass,” “salmon,” or “trout”) and another of corresponding weights. Notice that the salmon are generally heavy, the trout are light, and the bass are in between.

If we were to convert the species vector into a factor withfactor(species), we would get the default alphabetical ordering: bass, salmon, trout. If we’d prefer to organize the levels according to the mean of the group weights, we can usereorder():

With this assignment,species_factorwill be an ordered factor withtrout < bass < salmon. This small line of code does quite a lot, actually. It runs themean()function on each group of weights defined by the different species labels, sorts the results by those means, and uses the corresponding group ordering to set the factor levels. What is even more impressive is that we could have just as easily usedmedianinstead of mean, or any other function that operates on a numeric vector to produce a numeric summary. This idea of specifying functions as parameters in other functions is one of the powerful “functional” approaches taken by R, and we’ll be seeing more of it.

Final Notes about Factors

In many ways, factors work much like character vectors and vice versa;%in%and==can be used to compare elements of factors just as they can with character vectors (e.g.,tissues_factor == "A"returns the expected logical vector).

Factors enforce that all elements are treated as one of the levels named in thelevelsattribute, orNAotherwise. A factor encoding 1 and 2 asmaleandfemale, for example, will treat any other underlying integer asNA. To get a factor to accept novel levels, the levels attribute must first be modified with thelevels()function.

Finally, because factors work much like character vectors, but don’t print their quotes, it can be difficult to tell them apart from other types when printed. This goes for simple character vectors when they are part of data frames. Consider the following printout of a data frame:

Because quotation marks are left off when printing data frames, it’s impossible to tell from this simple output that theidcolumn is a character vector, thetissuecolumn is a factor, thecountcolumn is an integer vector, and thegroupcolumn is a factor.[2] Usingclass()on individual columns of a data frame can be useful for determining what types they actually are.


  1. In the annotation file PZ.annot.txt, each sequence ID (column 1) may be associated with multiple gene ontology (GO) “numbers” (column 2) and a number of different “terms” (column 3). Many IDs are associated with multiple GO numbers, and there is nothing to stop a particular number or term from being associated with multiple IDs.While most of the sequence IDs have an underscore suffix, not all do. Start by reading in this file (columns are tab separated) and then extracting asuffix_onlydata frame containing just those rows where the sequence ID contains an underscore. Similarly, extract ano_suffixdata frame for rows where sequence IDs do not contain an underscore.

    Next, add to thesuffix_onlydata frame columns forbase_idandsuffix, where base IDs are the parts before the underscore and suffices are the parts after the underscore (e.g.,base_idis"PZ7180000023260"andsuffixis"APN"for the ID"PZ7180000023260_APN").

    Finally, produce versions of these two data frames where theGO:prefix has been removed from all entries of the second column.

  2. The lines <- sample(c("0", "1", "2", "3", "4"), size = 100, replace = TRUE)generates a character vector of 100 random"0"s,"1"s,"2"s,"3"s, and"4"s.Suppose that"0"means “Strongly Disagree,”"1"means “Disagree,”"2"means “Neutral,”"3"means “Agree,” and"4"means “Strongly Agree.” Convertsinto an ordered factor with levelsStrongly Disagree < Disagree < Neutral < Agree < Strongly Agree.
  3. Like vectors, data frames (both rows and columns) can be selected by index number (numeric vector), logical vector, or name (character vector). Suppose thatgrade_currentis a character vector generated bygrade_current <- sample(c("A", "B", "C", "D", "E"), size = 100, replace = TRUE), andgpais a numeric vector, as ingpa <- runif(100, min = 0.0, max = 4.0). Further, we add these as columns to a data frame,grades <- data.frame(current_grade, gpa, stringsAsFactors = FALSE).

    We are interested in pulling out all rows that have"A","B", or"C"in thecurrent_gradecolumn. Describe, in detail, what each of the three potential solutions does:How does R interpret each one (i.e., what will R try to do for each), and what would the result be? Which one(s) is (are) correct? Which will report errors? Are the following three lines any different from the above three in what R tries to do?

The titanic_data.csv file (available on the course git repository) containers information on 1309 passengers from aboard the Titanic (CSV stands for Comma-Separated-Values, a simple plain text format for storing spreadhsheet data). Variables in this data set include gender, age, ticketed class, the passenger’s destitation, whether they survived, etc. We’ll use this data set to explore some of the demographics of the passengers who were aboard the ship, and how their relationship to whether a passenger survived or not. For a detailed description of this data set, see this link.

We’ll use this data to explore whether the saying “Women and children first!” applied on the Titanic.

3.5 Basic Plots

R has some basic plotting functions, but they’re difficult to use and aesthetically not very nice. They can be useful to have a quick look at data while you’re working on a script, though. The function plot() usually defaults to a sensible type of plot, depending on whether the arguments x and y are categorical, continuous, or missing.

Figure 3.1: plot() with categorical x

Figure 3.2: plot() with categorical x and continuous y

Figure 3.3: plot() with continuous x and y

The function hist() creates a quick histogram so you can see the distribution of your data. You can adjust how many columns are plotted with the argument breaks .

Why does it matter whether a variable is categorical, ordinal or numerical?

Statistical computations and analyses assume that the variables have a specific levels of measurement. For example, it would not make sense to compute an average hair color. An average of a nominal variable does not make much sense because there is no intrinsic ordering of the levels of the categories. Moreover, if you tried to compute the average of educational experience as defined in the ordinal section above, you would also obtain a nonsensical result. Because the spacing between the four levels of educational experience is very uneven, the meaning of this average would be very questionable. In short, an average requires a variable to be numerical. Sometimes you have variables that are “in between” ordinal and numerical, for example, a five-point Likert scale with values “strongly agree”, “agree”, “neutral”, “disagree” and “strongly disagree”. If we cannot be sure that the intervals between each of these five values are the same, then we would not be able to say that this is an interval variable, but we would say that it is an ordinal variable. However, in order to be able to use statistics that assume the variable is numerical, we will assume that the intervals are equally spaced.


A logical variable is a variable with only two values TRUE or FALSE :

It is also possible to transform logical data into numeric data. After the transformation from logical to numeric with the as.numeric() command, FALSE values equal to 0 and TRUE values equal to 1:

Conversely, numeric data can be converted to logical data, with FALSE for all values equal to 0 and TRUE for all other values.

Thanks for reading. I hope this article helped you to understand the basic data types in R and their particularities. If you would like to learn more about the different variable types from a statistical point of view, read the article “Variable types and examples”.

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion.

3.7: Character and Categorical Data - Biology

Taxa, characters and cells can also be selected using items in the Select menu of the Character Matrix Editor. These allow you to select variable characters, to select stretches of sequence matching a currently selected stretch, to reverse the current selection, and perform other changes to the selection.

Searching the matrix

The contents of the matrix can be searched in two ways. First, the Search area at the top of the window can be set to Search Data (set by clicking on the little icon until it shows as ). Then, you can enter a search string and hit return.

Second, you can search the matrix using the Find commands in the Edit menu, and the Select by Search in the Select menu. In the Edit menu, Find String. selects the first instance cell in the matrix containing the given string of text. It searches first taxon names, the character names, the the character states within the matrix. You can find subsequent instances using the Find Again command. Find All selects all those cells containing a given string. Find Footnote operates like Find String, except that it highlights cells that contain footnotes with the given text. (To find text within the more elaborate Annotations, you will need to call up an annotations window using the Annotate (pencil) tool in the character matrix editor, then choose Find Annotation in the Notes menu.)


You can copy taxon and character names from one region of the matrix to another and from one matrix to another. You can also copy one or more cells in the matrix to the Clipboard, and paste them into another region of the matrix, or into another matrix. Mesquite allows you to do with discontinuous selections, as long as the number of cells selected in the first taxon that is selected while copying is the same as the number of cells selected in the first taxon that is selected while pasting, and the same for subsequent taxa. That is, if you select two cells in taxon 3, one cell in taxon 5, and four cells in taxon 7, and copy this to the Clipboard, then when you paste, you must select two cells in the first taxon in which to paste, one in the next, and four in the last.

Mesquite will not let you paste a block of cells into the matrix while you have selected a differently shaped block in the matrix. However, if you attempt to do that, Mesquite will offer to change the selection so that covers the same number of cells as in the selection. You may then attempt again to paste.

Editing Names of Characters and States

Character names can be assigned either by editing the column headings in the Character Matrix Editor, or by editing the character names directly in the List of Characters window.

The State Names Editor, available by choosing (Character Matrix) Matrix>Edit State Names , allows you to name the states of categorical characters. It will not be available if your matrix is specified as nucleotide or protein data. You can change the orientation (states by characters, or characters by states) of the State Names Editor by touching the double arrow at the top left of the window. Footnotes can be attached to particular states by selecting the state and typing the footnote in the annotation area at the bottom of the window.


You can annotate the character matrix by attaching simple footnotes, more elaborate annotations, or colors to the taxa, characters or cells of the matrix. Simple footnotes can be attached by selecting the cell with the arrow or I-beam tool, then going to the white annotation area at the bottom of the window and entering the footnote. More elaborate annotations and colors can be attached using the Annotations Panel, available by selecting Show Annotations Panel in the Matrix menu. Currently the footnotes and annotations systems are separate in Mesquite — the footnotes appear in the annotation area at the bottom of many windows the elaborate Annotations appear in a panel embedded in the Character Matrix Editor. An example data file with annotations is at Mesquite_Folder/examples/Basic_Examples/characters/11a-annotations.nex

The annotations panel appears at the right side of the window, as follows:

The annotations panel above shows the annotations (if any) associated with a given taxon, character or cell in the matrix (depending on what was selected with the annotation (pencil) tool). Notes can be added or deleted using the (+) or trash buttons at the top left. One image can be added to each note, and labels can be added to the images using the I-beam cursor. To control appearance of these labels, right click or control-click on the label to get a drop down menu adjusting the font, color and other properties of the label. The label's pointer can be relocated using the Adjust Pointer tool. Other features of the annotations can be accessed using the Annotations submenu of the Matrix menu. You can search for annotations containing text using the Find Annotation menu item of the Annotations submenu.

Coloring cells of the matrix

Cells or their text can be colored. The default background cell color is chosen in the (Character Matrix) Matrix>Background Color submenu. Colors to distinguish different cells can be specified using the items in the (Character Matrix) Matrix>Color Cells and (Character Matrix) Matrix>Color Text submenus. These submenus specify the color to be used for the background of the cell, or for the text within the cell, according to the following criteria:

  • Character value — A cell is colored according to a value for the entire character, such as parsimony character steps.
  • Cell value — A cell is colored according to a value for that particular cell. For instance, with DNA sequence data, the cells can be colored blue if the site is G or C, white if A or T. By selecting (Character Matrix) Matrix>Moving Window (for colors). , you can set the size of the moving window over which GC content is averaged. Other cell values are available for amino acid properties (e.g., hydrophobicity)
  • Excluded — A cell is colored gray if its character is excluded.
  • Footnote present— A cell is colored green if it has a footnote.
  • Character State — A cell is colored according to the character state (e.g. different colors for A, C, G, T)
  • Annotation attached — A cell is colored green if annotation (not footnotes, but the full complex annotations) are attached to it.
  • Assigned Colors — A cell is shown with color as assigned by the paintbrush tool (). To assign a color to a cell, click on the cell with the paintbrush. Touch and hold the button in the tool palette to obtain a menu to select the color used, remove colors or color all selected cells.

When cells are colored, you may request a legend for the colors by selecting Show Color Legend in the Matrix menu, or by touching the small button() at lower left of the Matrix Editor (beneath the taxon names). If you double click on a color in the matrix, the editor will move to a cell with that color.

Alterations and Transformations

The following are available in the Alter/Transform menu of the Matrix menu to modify the cells or characters of a matrix:

  • Filling selected cells with a specified state: Choose (Character Matrix) Matrix>Alter/Transform>Fill
  • Filling selected cells with random states with equal frequency for all states: Choose (Character Matrix) Matrix>Alter/Transform>Random Fill
  • Randomly reshuffle the states within a character among the selected taxa: Choose (Character Matrix) Matrix>Alter/Transform>Shuffle states among taxa
  • For nucleotide sequence data, convert the entries in each cell into their complement: Choose (Character Matrix) Matrix>Alter/Transform>Nucleotide complement
  • Reversing a selected molecular sequence: Choose (Character Matrix) Matrix>Alter/Transform>Reverse Sequence
  • Removing characters consisting of nothing but gaps: Choose (Character Matrix) Matrix>Alter/Transform>Remove Gaps-Only Characters
  • Removing invariant characters: Choose (Character Matrix) Matrix>Alter/Transform>Remove Invariant Characters

Changing the attributes of characters

A character, in addition to having states assigned in each of the terminal taxa, may also have other attributes. For instance, a character is marked as included or excluded, and it has assumptions attributed to it, such as a weight and a parsimony model of evolution. These attributes are used in various calculations. They may be assigned in the List of Characters windows, available in the Characters menu.

In the List of Characters window, columns refer to inclusion, parsimony model and probability model (for likelihood calculations). Other columns can be requested for Group Membership, Weight, and (for DNA data) Codon Position. You can ask to show a column using the Columns menu. For each of these columns, the assigned attribute can be changed by first either selecting the characters to be changed (if only some characters are to be altered) or selecting the attribute's column (if all characters are to be altered). Then, by touching the name of the column (where an inverted black triangle should appear), a drop down menu appears that allows you to make the appropriate specification.

For each of the attributes other than group membership, the bottom three menu items are to store to the file the current specification as a named specification set (like saving a typeset or weightset in MacClade), to replace an existing specification set with the current one, or to load a stored specification set to become the current.

  • Inclusion - Include, Exclude, and Reverse allow you to change character inclusion. Reverse changes excluded to included and vice versa. Characters that are excluded don't participate in treelength and many other calculations. Exclusion is not universally respected by the calculations, for some calculations use even characters that are excluded.
  • Parsimony model - The Model submenu allows you to select a parsimony model to assign to the characters, for use in parsimony calculations.
  • Probability model - The Model submenu allows you to select a probability model to assign to the characters, for use in likelihood calculations and simulations. Group Membership - The image above shows Group Membership in the last column. To assign characters to groups, you must first create groups using New Group. For instance, you could create a group Adult Characters and another group Larval Characters. Then, you can assign character to the group using the Set Group submenu. You can also edit the color of the group, and rename the group. The group color is useful to distinguish characters of different groups, for instance in charts or in the Character Matrix editor.
  • Weight - With the Set weight menu item you can set the weight assigned to the character. This is used currently on in treelength calculations.
  • Codon Positions — This column is available for DNA data. The drop down menu allows you to assign positions.

Charting information about Characters

Informative statistics and values for characters can be viewed or charted in various windows. In the Analysis menu a Bar & Line Chart can be requested to show the distribution of a value for a series of characters. For instance, the number of parsimony steps in the characters on a current tree can be charted. The Scattergram available in the Analysis menu plots characters in a two dimensional space with the X axis being one particular value (e.g., the character's likelhood under one tree), the Y axis another value (e.g. the character's likelihood under a different tree). Values for characters can also be viewed in the List of Characters window, where columns can be added (in the List menu) to show selected statistics for each of the characters.

Listing all character matrices

The List of Character Matrices menu item under Characters brings up a list of character matrices. Here you can view statistics and change names of character matrices.

Copyright © 2002-2011 by Wayne P. Maddison and David R. Maddison.
All rights reserved.

SAS Data Object¶

The SASdata object is a reference to a SAS Data Set or View. It is used to access data that exists in the SAS session. You create a SASdata object by using the sasdata() method of the SASsession object.

Parms for the sasdata() method of the SASsession object are:

  • table – [Required] the name of the SAS Data Set or View
  • libref – [Defaults to WORK] the libref for the SAS Data Set or View.
  • results – format of results, SASsession.results is default, PANDAS, HTML or TEXT are the alternatives
  • dsopts

a dictionary containing any of the following SAS data set options(where, drop, keep, obs, firstobs, format):

  • where is a string
  • keep are strings or list of strings.
  • drop are strings or list of strings.
  • obs is a numbers - either string or int
  • first obs is a numbers - either string or int
  • format is a string or dictionary
  • encoding is a string

Copy table to itesf, or to ‘out=’ table and add any vars if you want

  • vars – REQUIRED dictionayr of variable names (keys) and assignment statement (values) to maintain variable order use collections.OrderedDict Assignment statements must be valid SAS assignment expressions.
  • out – OPTIONAL takes a SASdata Object you create ahead of time. If not specified, replaces the existing table and the current SAS data object still refers to the replacement table.
  • kwargs

SAS Log showing what happened

  1. cars = sas.sasdata(‘cars’, ‘sashelp’)
  2. wkcars = sas.sasdata(‘cars’)
  3. cars.add_vars(<‘PW_ratio’: ‘weight / horsepower’, ‘Overhang’ : ‘length - wheelbase’>, wkcars)
  4. wkcars.head()

Append ‘data’ to this SAS Data Set. data can either be another SASdataobject or a Pandas DataFrame, in which case dataframe2sasdata(data) will be run for you to load the data into a SAS data Set which will then be appended to this data set.

  • data – Either a SASdata object or a Pandas DataFrame
  • force – boolean to force appended even if anomolies exist which could cause dropping or truncating

This method will calculate assessment measures using the SAS AA_Model_Eval Macro used for SAS Enterprise Miner. Not all datasets can be assessed. This is designed for scored data that includes a target and prediction columns TODO: add code example of build, score, and then assess

  • target – string that represents the target variable in the data
  • prediction – string that represents the numeric prediction column in the data. For nominal targets this should a probability between (0,1).
  • nominal – boolean to indicate if the Target Variable is nominal because the assessment measures are different.
  • event – string which indicates which value of the nominal target variable is the event vs non-event
  • kwargs

This method requires a character column (use the contents method to see column types) and generates a bar chart.

  • var – the CHAR variable (column) you want to plot
  • title – an optional title for the chart
  • label – LegendLABEL= value for sgplot

display metadata about the table, size, number of rows, columns and their data type

display metadata about the table. size, number of rows, columns and their data type . You can override the format of the output with the results= option for this invocation

delete ( quiet=False ) ¶

Delete this data set the SASdata object is still available

Returns:SASLOG for this step
describe ( ) ¶

display descriptive statistics for the table summary statistics.

display the first n rows of a table

Parameters:obs – the number of rows of the table that you want to display. The default is 5
heatmap ( x: str, y: str, options: str = '', title: str = '', label: str = '' ) → object¶

  • x – x variable
  • y – y variable
  • options – display options (string)
  • title – graph title
  • label

This method requires a numeric column (use the contents method to see column types) and generates a histogram.

  • var – the NUMERIC variable (column) you want to plot
  • title – an optional Title for the chart
  • label – LegendLABEL= value for sgplot

Imputes missing values for a SASdata object.

  • vars – a dictionary in the form of <‘varname’:’impute type’>or
  • replace
  • prefix
  • out

Display the column info on a SAS data object

Returns:Pandas data frame
means ( ) ¶

display descriptive statistics for the table summary statistics. This is an alias for ‘describe’

modify ( formats: dict = None, informats: dict = None, label: str = None, renamevars: dict = None, labelvars: dict = None ) ¶

Modify a table, setting formats, informats or changing the data set name itself or renaming variables or adding labels to variables

  • formats – dict of variable names and formats to assign
  • informats – dict of variable names and informats to assign
  • label – string of the label to assign to the data set if it requires outer quotes, provide them
  • renamevars – dict of variable names and new names tr rename the variables
  • labelvars – dict of variable names and labels to assign to them if any lables require outer quotes, provide them

return the number of observations for your SASdata object

partition ( var: str = '', fraction: float = 0.7, seed: int = 9878, kfold: int = 1, out: typing.Union[saspy.sasdata.SASdata, NoneType] = None, singleOut: bool = True ) → object¶

Partition a sas data object using SRS sampling or if a variable is specified then stratifying with respect to that variable

  • var – variable(s) for stratification. If multiple then space delimited list
  • fraction – fraction to split
  • seed – random seed
  • kfold – number of k folds
  • out – the SAS data object
  • singleOut – boolean to return single table or seperate tables

Tuples or SAS data object

Parameters:name – new name for this data set
Returns:SASLOG for this step
scatter ( x: str, y: list, title: str = '' ) → object¶

This method plots a scatter of x,y coordinates. You can provide a list of y columns for multiple line plots.

  • x – the x axis variable generally a time or continuous variable.
  • y – the y axis variable(s), you can specify a single column or a list of columns
  • title – an optional Title for the chart

This method is meant to update a SAS Data object with a model score file.

  • file – a file reference to the SAS score code
  • code – a string of the valid SAS score code
  • out – Where to the write the file. Defaults to update in place

The Scored SAS Data object.

This method plots a series of x,y coordinates. You can provide a list of y columns for multiple line plots.

  • x – the x axis variable generally a time or continuous variable.
  • y – the y axis variable(s), you can specify a single column or a list of columns
  • title – an optional Title for the chart

This method set the results attribute for the SASdata object it stays in effect till changed results - set the default result type for this SASdata object. ‘Pandas’ or ‘HTML’ or ‘TEXT’.

Parameters:results – format of results, SASsession.results is default, PANDAS, HTML or TEXT are the alternatives
sort ( by: str, out: object = '', **kwargs ) → saspy.sasdata.SASdata¶

  • by – REQUIRED variable to sort by (BY <DESCENDING> variable-1 <<DESCENDING> variable-2 . >)
  • out – OPTIONAL takes either a string ‘libref.table’ or ‘table’ which will go to WORK or USER if assigned or a sas data object’’ will sort in place if allowed
  • kwargs

SASdata object if out= not specified, or a new SASdata object for out= when specified

  1. wkcars.sort(‘type’)
  2. wkcars2 = sas.sasdata(‘cars2’)
  3. wkcars.sort(‘cylinders’, wkcars2)
  4. cars2=cars.sort(‘DESCENDING origin’, out=’foobar’)
  5. cars.sort(‘type’).head()
  6. stat_results = stat.reg(model=’horsepower = Cylinders EngineSize’, by=’type’, data=wkcars.sort(‘type’))
  7. stat_results2 = stat.reg(model=’horsepower = Cylinders EngineSize’, by=’type’, data=wkcars.sort(‘type’,’’))

display the last n rows of a table

Parameters:obs – the number of rows of the table that you want to display. The default is 5
to_csv ( file: str, opts: dict = None ) → str¶

This method will export a SAS Data Set to a file in CSV format.

  • file – the OS filesystem path of the file to be created (exported from this SAS Data Set)
  • opts

a dictionary containing any of the following Proc Export options(delimiter, putnames)

Export this SAS Data Set to a Pandas Data Frame

defaults to MEMORY As of V3.7.0 all 3 of these now stream directly into read_csv() with no disk I/O and have much improved performance. MEM, the default, is now as fast as the others.

  • MEMORY the original method. Streams the data over and builds the dataframe on the fly in memory
  • CSV uses an intermediary Proc Export csv file and pandas read_csv() to import it faster for large data
  • DISK uses the original (MEMORY) method, but persists to disk and uses pandas read to import. this has better support than CSV for embedded delimiters (commas), nulls, CR/LF that CSV has problems with

For the CSV and DISK methods, the following 2 parameters are also available As of V3.7.0 all 3 of these now stream directly into read_csv() with no disk I/O and have much improved performance. MEM, the default, is now as fast as the others.

  • tempfile – [deprecated except for Local IOM] [optional] an OS path for a file to use for the local file default it a temporary file that’s cleaned up
  • tempkeep – [deprecated except for Local IOM] if you specify your own file to use with tempfile=, this controls whether it’s cleaned up after using it

For the MEMORY and DISK methods the following 4 parameters are also available, depending upon access method

  • rowsep – the row seperator character to use defaults to hex(1)
  • colsep – the column seperator character to use defaults to hex(2)
  • rowrep – the char to convert to for any embedded rowsep chars, defaults to ‘ ‘
  • colrep – the char to convert to for any embedded colsep chars, defaults to ‘ ‘
  • kwargs – a dictionary. These vary per access method, and are generally NOT needed. They are either access method specific parms or specific pandas parms. See the specific sasdata2dataframe* method in the access method for valid possibilities.

This is an alias for ‘to_df’ specifying method=’CSV’.

  • tempfile – [deprecated except for Local IOM] [optional] an OS path for a file to use for the local CSV file default it a temporary file that’s cleaned up
  • tempkeep – [deprecated except for Local IOM] if you specify your own file to use with tempfile=, this controls whether it’s cleaned up after using it
  • opts

a dictionary containing any of the following Proc Export options(delimiter, putnames)

This is an alias for ‘to_df’ specifying method=’DISK’.

  • tempfile – [deprecated] [optional] an OS path for a file to use for the local file default it a temporary file that’s cleaned up
  • tempkeep – [deprecated] if you specify your own file to use with tempfile=, this controls whether it’s cleaned up after using it
  • rowsep – the row seperator character to use defaults to hex(1)
  • colsep – the column seperator character to use defaults to hex(2)
  • rowrep – the char to convert to for any embedded rowsep chars, defaults to ‘ ‘
  • colrep – the char to convert to for any embedded colsep chars, defaults to ‘ ‘
  • kwargs – a dictionary. These vary per access method, and are generally NOT needed. They are either access method specific parms or specific pandas parms. See the specific sasdata2dataframe* method in the access method for valid possibilities.

This is just an alias for to_df()

Returns:Pandas data frame
Return type:‘pd.DataFrame’
to_json ( pretty: bool = False, sastag: bool = False, **kwargs ) → str¶

  • pretty – boolean False return JSON on one line True returns formatted JSON
  • sastag – include SAS meta tags
  • kwargs

Return the most commonly occuring items (levels)

  • var – the CHAR variable (column) you want to count
  • n – the top N to be displayed (defaults to 10)
  • order – default to most common use order=’data’ to get then in alphbetic order
  • title – an optional Title for the chart

This method returns a clone of the SASdata object, with the where attribute set. The original SASdata object is not affected.

Parameters:where – the where clause to apply
Returns:SAS data object

In this and future materials to be posted on the course website you’ll encounter blocks of R code. Your natural intuition will be to cut and paste commands or code blocks into the R interpretter to save yourself the typing. DO NOT DO THIS!!

In each of the examples below, I provide example input, but I don’t show you the output. It’s your job to type in these examples at the R console, evaluate what you typed, and to look at and think critically about the output. You will make mistakes and generate errors! Part of learning any new skill is making mistakes, figuring out where you went wrong, and correcting those mistakes. In the process of fixing those errors, you’ll learn more about how R works, and how to avoid such errors, or correct bugs in your own code in the future. If you cut and paste the examples into the R interpretter the code will run, but you will learn less than if you input the code yourself and you’ll be less capable of apply the concepts in new situations.

The R interpretter, like all programming languages, is very exacting. A mispelled variable or function name, a stray period, or an unbalanced parenthesis will raise an error and finding the sources of such errors can sometimes be tedious and frustrating. Persist! If you read your code critically, think about what your doing, and seek help when needed (teaching team, R help, Google, etc) you’ll eventually get better at debugging your code. But keep in mind that like most new skills, learning to write and debug your code efficiently takes time and experience.

Clark Lab

Ecological attributes include species abundances, traits, and individual condition (e.g., growth or infection status), to name a few. They are multivariate data, but not all of one type. They can be combinations of presence-absence, ordinal, continuous, discrete, composition, or zero-inflated. gjam provides inference on sensitivity to input variables, correlations between responses, model selection, prediction of responses, inverse prediction of predictors, and community classification by response to predictors.

gjam was motivated by species distribution and abundance data, but can provide an attractive alternative to traditional methods wherever observations are multivariate and combine multiple scales and mixtures of continuous and discrete data.

Importantly, analysis is done on the observation scale . That is, coefficients and covariances are interpreted on the same scale as the data. This contrasts with standard generalized linear models (GLMs), where coefficients and covariances are difficult to interpret and cannot be compared across responses that are modeled on different scales and with nonlinear link functions.

gjam accommodates massive zeros in multivariate data by avoiding the standard mixtures used in zero-inflated GLMs. Instead, gjam relies on censoring.

gjam exploits censoring to combine multiple data types in a single model, including mixtures of continuous and discrete data. For example, the microbial community (composition data) might be tracked together with host condition (continuous, categorical, binary, ordinal, …).


Clark, J.S., D. Nemergut, B. Seyednasrollah, P. Turner, and S. Zhang. 2017. Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data. Ecological Monographs, 87, 34–56. Clark2017EcolMonogr, clarksupplement.

Presents the motivation and model summarizes computation in gjam. The Supplement file provides additional detail on algorithms.

Taylor-Rodríguez, D., Kaufeld, K., Schliep, E. M., Clark, J. S., Gelfand, A. E. 2017. Joint species distribution modeling: Dimension reduction using Dirichlet processes. Bayesian Analysis, doi: 10.1214/16-BA1031. bayesanaly2016

Many applications require large numbers of response variables. Microbiome studies bring the additional complication of composition data. And most observed values can still be zero. This paper describes the Dirichlet process prior implemented in gjam that finds a low-dimensional representation for the covariance between responses.

Dynamic species interactions:

Clark, J. S., C. L. Scher, and M. Swift. 2020. The emergent interactions that govern biodiversity change. Proceedings of the National Academy of Sciences, 202003852, clarkPNAS2003852117.full

Quantifying species interactions requires dynamic data and the integration of environmental effects. This paper defines and estimates environment-species interactions (ESI) with full uncertainty.

Microbiome applications:

Wang, Z., D. L. Juarez, J.-F. Pan,2, S. K. Blinebry, J. Gronniger, J. S. Clark, Z. I. Johnson, and D. E. Hunt. 2019. Microbial communities across nearshore to offshore coastal transects are shaped by both distance from shore and seasonality. Environmental Microbiology, 21, 3862-3872.

16S rRNA gene libraries reveal distinct nearshore, continental shelf, and offshore oceanic communities. Water temperature and distance from shore both most influence community composition. However, at the phylotype level, the distribution of some taxa is linked to temperature, others to distance from shore and some by both.

Bachelot B., Uriarte M., Muscarella R., Forero-Montana J., Thompson J., McGuire K., Zimmerman J.K., Swenson N.G. and J.S. Clark. 2018. Associations among arbuscular mycorrhizal fungi and seedlings are predicted to change with tree successional status. Ecology 99: 607-620.

Seedlings of early‐successional tree species may not rely as much as mid‐ and late‐successional species on AM fungi, and AM fungi may accelerate forest succession.

Trait analysis:

Clark, J.S. 2016. Why species tell us more about traits than traits tell us about species: Predictive models. Ecology,97, 1979–1993, ecology2016, ecology2016_AppendixS1

The joint distribution of ecological attributes (‘traits’) can be modeled together with species, separately, or predicted from the joint distribution of species. This paper describes the model and computation implemented in gjam.

Vignette with R code and applications: gjam vignette

Below are cluster plots of the correlation matrix for a presence-absence model (a), continuous abundance model (b), and the response to environmental variables (d). The cluster analysis in (c) is based on distances in (d). These plots are obtained by specifying GRIDPLOTS=T in gjamPlot.

Main contributors:

Jim Clark wrote the GJAM model, the R and C++ code, and the GJAM package.

Alan Gelfand and Daniel Taylor-Rodrigues wrote the Dirichlet process model and algorithms for dimension reduction.

Daniel Taylor-Rodrigues implemented the Dirichlet process in R and C++ in GJAM.

Bene Bachelot, Chase Nuñes, and Brad Tomasek provided extensive testing and feedback through all stages of development.

Many others: Students in the course Bayesian Inference Environm Models (BIO/ENV 665) at Duke University and members of the Multivariate Modeling working group of the SAMSI Ecology program contributed many ideas, recommendations, and feedback.

Installation in R or RStudio:

Publications using GJAM

Whitworth, A., Beirne, C., Flatt, E., Froese, G., Nuñez, C. and Forsyth, A. (2021), Recovery of dung beetle biodiversity and traits in a regenerating rainforest: a case study from Costa Rica’s Osa Peninsula. Insect Conservation and Diversity.

Clark, J. S., C. L. Scher, and M. Swift. 2020. The emergent interactions that govern biodiversity change. Proceedings of the National Academy of Sciences, 17 (29) 17074-17083 DOI: 10.1073/pnas.2003852117 .

Wang, Z., D. L. Juarez, J.-F. Pan,2, S. K. Blinebry, J. Gronniger, J. S. Clark, Z. I. Johnson, and D. E. Hunt. 2019. Microbial communities across nearshore to offshore coastal transects are shaped by both distance from shore and seasonality. Environmental Microbiology, 21(10):3862-3872. doi: 10.1111/1462-2920.14734.

3.11 Data transformations

Plots in which most points are huddled up in one area, with much of the available space sparsely populated, are difficult to read. If the histogram of the marginal distribution of a variable has a sharp peak and then long tails to one or both sides, transforming the data can be helpful. These considerations apply both to x and y aesthetics, and to color scales. In the plots of this chapter that involved the microarray data, we used the logarithmic transformation 57 57 We used it implicitly since the data in the ExpressionSet object x already come log-transformed. – not only in scatterplots like Figure 3.25 for the (x) and (y) -coordinates, but also in Figure 3.38 for the color scale that represents the expression fold changes. The logarithm transformation is attractive because it has a definitive meaning - a move up or down by the same amount on a log-transformed scale corresponds to the same multiplicative change on the original scale: (log(ax)=log a+log x) .

Sometimes the logarithm however is not good enough, for instance when the data include zero or negative values, or when even on the logarithmic scale the data distribution is highly uneven. From the upper panel of Figure 3.40, it is easy to take away the impression that the distribution of M depends on A , with higher variances for lower A . However, this is entirely a visual artefact, as the lower panel confirms: the distribution of M is independent of A , and the apparent trend we saw in the upper panel was caused by the higher point density at smaller A .

Figure 3.40: The effect of rank transformation on the visual perception of dependency.

Can the visual artefact be avoided by using a density- or binning-based plotting method, as in Figure 3.29?