Send to printer

References and Objects

Rob Halgren (From Steve Rozen, 2001)

Genome Informatics

Make Sure You Understand Arrays and Hashes Before Attempting These Notes!!

Suggested Reading

Christiansen and Torkington, Perl Cookbook, Chapter 11, "References and Records" (and, for the self destructive and foolhardy, Section 13.13, "Coping with Circular Data Structures").

The perlref man page.
The perlobj man page.

Lecture Notes


What Good Are References?

Sometimes you need a more complex data structure!
Examples:

  • An array of arrays (can do the job of a 2-dimensional matrix).
    DATA:
    Spot_num Ch1-BKGD CH1 Ch2-BKGD Ch2
    000 0.124 43.2 0.102 80.4
    001 0.113 60.7 0.091 22.6
    002 0.084 112.2 0.144 35.3

    CODE
    my @spotarray = ([0.124, 43.2, 0.102, 80.4], [0.113, 60.7, 0.091, 22.6], [0.084, 112.2, 0.144, 35.3]);

  • A hash of arrays
    Accession Ch1-BKGD CH1 Ch2-BKGD Ch2
    AW10021 0.124 43.2 0.102 80.4
    BE52002 0.113 60.7 0.091 22.6
    W20209 0.084 112.2 0.144 35.3

    CODE
    my %spothash = ('AW10021'=> [0.124, 43.2, 0.102, 80.4], 'BE52002'=> [0.113 ,60.7 ,0.091, 22.6]);

  • Hashes of hashes, and other even more complex data structures

What Is A Reference?

Well, first, what is a variable?

Think of a variable as a (named) box that holds a value. The name of the box is the name of the variable. After

$x = 1;

we have

      +---+
$x:   | 1 |
      +---+
After
@y = (1, 'a', 23);
we have
                   +---------------+
               @y: | (1, 'a',  23) |
                   +---------------+

Making References To Variables' Values

$list_ref = \@array;
$map_ref  = \%hash
$c_ref    = \$count;

Refs to subroutines

$sub_ref = \&subroutine;

A reference is an additional, rather different way, to name the variable. After

$ref_to_y = \@y;
we have
                     +---------------+
             +-> @y: | (1, 'a',  23) |
             |       +---------------+
             |
             |
           +-|-+
$ref_to_y: | * |
           +---+
$ref_to_y contains a reference (pointer) to @y.

print @y yields 1a23 and print $ref_to_y yields ARRAY(0x80cd6ac).

Getting At The Value ('de-referencing')

@{array_reference}
%{hash_reference}
${scalar_reference}

print @{$ref_to_y} yields 1a23. After

$y[3] = 'z';
print @{$ref_to_y}
yields 1a23z. Why?
                     +--------------------+
             +-> @y: | (1, 'a',  23, 'z') |
             |       +--------------------+
             |
             |
           +-|-+
$ref_to_y: | * |
           +---+
After
@y = (5, 6, 7);
we have
                     +----------+
             +-> @y: | (5, 6, 7)|
             |       +----------+
             |
             |
           +-|-+
$ref_to_y: | * |
           +---+
print @{$ref_to_y}
yields 567.

After

$ref_to_y2 = $ref_to_y
we have
            +---+
$ref_to_y2: | * |
            +-|-+
              |
              |       +-----------+
              +-> @y: | (5, 6, 7) |
              +->     +-----------+
              |
              |
            +-|-+
$ref_to_y:  | * |
            +---+

print @{$ref_to_y}
and
print @{$ref_to_y2}
both yield 567. After
@z = @{$ref_to_y}
$ref_to_y->[0] = '2';
$ref_to_y2->[2] = '24';
we have
            +---+
$ref_to_y2: | * |
            +-|-+
              |
              |       +------------+
              +-> @y: | (2, 6, 24) |
              +->     +------------+
              |
              |
            +-|-+
$ref_to_y:  | * |
            +---+

      +-----------+
@z:   | (5, 6, 7) |
      +-----------+

Some (of many) ways to get at reference data

There are several different ways to 'dereference'
Given a reference called $array_ref (to our favorite array);
  • my @new_array = @{$array_ref} # get entire array and assign to new variable
  • my $list_element = ${$array_ref}[1] # get '1th' element from the array referenced by $array_ref
  • my $list_element = $array_ref->[1] # same thing with 'arrow syntax'

Given a reference called $hash_ref (to our favorite hash);

  • my %hash_copy = %{$hash_ref}; # get a new copy of the hash referenced by $hash_ref
  • my $hash_value = ${$hash_ref}{'some_key'}; # assign the value associated with 'some_key' in the hash referenced by $hash_ref
  • my $hash_value = $hash_ref->{'some_key'}; # same thing!!

Given a reference called $my_cool_subroutine (to our favorite subroutine)

  • my $result = &{$my_cool_subroutine}($arg1,$arg2); #invoke &my_cool_subroutine with two arguments
  • my $result = $my_cool_subroutine->($arg1,$arg2); # look familiar??

Making References To Arbitrary Values From Scratch (Anonymous Hashes or Arrays)

$y_gene_families = ['DAZ', 'TSPY', 'RBMY', 'CDY1', 'CDY2' ];

$y_gene_family_counts = { 'DAZ'  => 4,
                          'TSPY' => 20,
                          'RBMY' => 10,
                          'CDY2' => 2 };

$y_gene_families gets a reference to an array, and $y_gene_family_counts gets a reference to a hash. (See the book for subroutines and scalars.) For example

for (keys %{$y_gene_family_counts}) { print "$_\n" }
my @a = @{$y_gene_families};
${$y_gene_families}[0];
${$y_gene_family_counts}{'DAZ'}
Arrow shorthand:
$y_gene_families->[0]; # yields 'DAZ'
$y_gene_family_counts->{'DAZ'} # yields '4'

New Function: ref

ref - What Kind Of Value Does This Reference Point To?

print ref($y_gene_families), "\n";
ARRAY 

print ref($y_gene_family_counts), "\n";
HASH

$x = 1; print ref($x), "\n";

(empty string)


Scripting Example: Hash of Hashes

#!/usr/bin/perl

use strict;

@ARGV = '/net/share/perl_refs/cosmids1.txt' unless @ARGV;

$/ = ">";
my %DATA;
while (<>) {
  chomp;
  my ($id_line,@rest) = split "\n";
  $id_line =~ /^(\S+)/ or next;
  my $id = $1;

  my $sequence = join '',@rest;
  my $length   = length $sequence;
  my $gc_count = $sequence =~ tr/gcGC/gcGC/;
  my $gc_content = $gc_count/$length;

  $DATA{$id} = { sequence   => $sequence,
		 length     => $length,
		 gc_content => sprintf("%3.2f",$gc_content)
	       };
}

my @ids = sort {  $DATA{$a}->{gc_content} <=> $DATA{$b}->{gc_content}
		} keys %DATA;

foreach my $id (@ids) {
  print "$id\n";
  print "\tgc content = $DATA{$id}->{gc_content}\n";
  print "\tlength     = $DATA{$id}->{length}\n";
  print "\n";
}

Looking At Complex Data Structures In The Debugger

Use the x command.


Perl Object Syntax

Perl objects are special references that come bundled with a set of functions that know how to act on the contents of the reference. For example, in BioPerl, there is a Sequence object. Internally, the Sequence object is a hash reference that has keys that point to the DNA string, the name and source of the sequence, and other attributes. The object is bundled with functions that know how to manipulate the sequence, such as revcom(), translate(), subseq(), etc.

When talking about objects, the bundled functions are known as methods. This terminology derives from the grandaddy of all object-oriented languages, Smalltalk. You invoke a method using the -> operator, a syntax that looks a lot like getting at the value that a reference points to.

For example, if we have a Sequence object stored in the scalar variable $sequence, we can call its methods like this:
$reverse_complement = $sequence->revcom();
$first_10_bases     = $sequence->subseq(1,10);
$protein            = $sequence->translate;

We assume that you've created a Sequence object and stored it into $sequence at some previous point. We'll see how to do this later.

First we call the Sequence object's revcom() method, which creates the reverse complement of the sequence and stores it into the scalar variable $reverse_complement. Then we call subseq(1,10) to return the subsequence spanning bases one through ten. Finally we call the object's translate() method to turn the DNA into a protein. You will learn from the BioPerl lecture that revcom(), subseq() and translate() are all returning new Sequence objects that themselves know how to revcom(), translate() and so forth. So if you wanted to get the protein translation from the reverse complement, you could do this:

$reverse_complement = $sequence->revcom();
$protein            = $reverse_complement->translate;

Don't be put off by this syntax! $sequence is really just a hash reference, and you can get its keys using keys %$sequence, peek at the contents of the "_seq_length" key using $sequence->{_seq_length}, and so forth. Indeed, the syntax $sequence->translate is just a fancy way of writing translate($sequence), except that the object knows what module the translate() function is defined in.

Using Objects

Before you can start using objects, you must load their definitions from the appropriate module(s). This is just like loading subroutines from modules, and you use the use statement in both cases. For example, if we want to load the BioPerl Sequence definitions, we load the appropriate module, which in this case is called Bio::PrimarySeq (you learn this from reading the BioPerl documentation):

#!/usr/bin/perl -w
use strict;
use Bio::PrimarySeq;

Now you'll probably want to create a new object. There are a variety of ways to do this, and details vary from module to module, but most modules, including Bio::PrimarySeq, do it using the new() method:

#!/usr/bin/perl -w
use strict;
use Bio::PrimarySeq;
my $sequence = Bio::PrimarySeq->new('gattcgattccaaggttccaaa');

The syntax here is ModuleName->new(@args), where ModuleName is the name of the module that contains the object definitions. The new() method will return an object that belongs to the ModuleName class. So in the example above, we get a Bio::PrimarySeq object, which is the simplest of BioPerl's various Sequence object types.

An alternative way to call new() puts the method name in front of the module name:

#!/usr/bin/perl -w
use strict;
use Bio::PrimarySeq;
my $sequence = new Bio::PrimarySeq('gattcgattccaaggttccaaa');

This is exactly equivalent to Bio::PrimarySeq->new(), but looks more natural to Java programmers.

Passing Arguments to Methods

When you call object methods, you can pass a list of arguments, just as you would to a regular function. We saw an example of this earlier when we called $sequence->subseq(1,10). As methods get more complex, argument lists can get quite long and have possibly dozens of optional arguments. To make this manageable, many object-oriented modules use a named parameter style of argument passing, that looks like this:

my $result = $object->method(-arg1=>$value1,-arg2=>$value2,-arg3=>$value3)

In this case "-arg1", "-arg2", and so on are the names of arguments, and $value1, $value2 are the values of those named arguments. The name/value pairs can occur in any order.

As a practical example, Bio::PrimarySeq->new() actually takes multiple optional arguments that allow you to specify the alphabet, the source of the sequence, and so forth. Rather than create a humungous argument list which forces you to remember the correct position of each argument, Bio::PrimarySeq lets you create a new Sequence this way:

#!/usr/bin/perl -w
use strict;
use Bio::PrimarySeq;
my $sequence = Bio::PrimarySeq->new(-seq => 'gattcgattccaaggttccaaa',
                                    -id          => 'oligo23',
                                    -alphabet    => 'dna',
                                    -is_circular => 0,
                                    -accession_number => 'X123'
                                   );

Notice that we've broken the argument list across multiple lines. This makes it easier to read, but means nothing special to Perl.


Workshop Problem Set

  1. In the debugger or a program try the examples in the section What Is A Reference above (@y, $ref_to_y, etc.)

  2. Get the Hash of Hashes example right here. Run it in the debugger, and set a breakpoint at line 37 (debugger command b 37). Then run to that breakpoint (debugger command c). At line 37 use x to print out the value of $e. What is the value of ref($DATA{$id})?

  3. (To do this problem steal and modify your own code from a previous problem set.) Write a subroutine unwrap that takes as its argument the name of a file, and that returns a reference to a HASH that maps sequence identifiers to sequences, e.g.
      $x = unwrap('/net/share/perl_refs/cosmids1.txt');
      print $x->{'ZK1307.9'}, "\n";
      print $x->{'ZK1248.6'}, "\n";
      print $x->{'ZK1236.5'}, "\n";
      

    produces

      atgggagagcgtaaaggacaa...
      atggcccaatccgtcccaccg...
      tcagtcccatcgttttcttgc...
      

    Hint (sketch only, not working code):

      sub unwrap {
        # ... Get argument, $filepath
        my $result = {};
        # ...
        while (...) {
           # ... Get $sequence_id and $sequence for each entry;
           $result->{$sequence_id} = $sequence;
        }
        $result;
      }
      

    Sample solution here.


    The remaining problems are for extra practice. If you got this far you know the basic material from this lecture.

  4. (To do this problem steal and modify your own code from a previous problem set.) Write another subroutine, codons_threeframe, that takes as input the data structure returned by the subroutine unwrap, and returns a HASH reference that maps sequence id's to 4 element HASH reference "records". Each record contains the keys: sequence, frame1, frame2, frame3. For example (from cosmids.fasta):
      { 'ZK1037.9' => { 'sequence' => 'atgggagagcgtaaaggacaa...',
                     'frame1'   => 'atg gga gag cgt aaa gga...',
                     'frame2'   => 'tgg gag agc gta aag gac...',
                     'frame3'   => 'ggg aga gcg taa agg aca...' },
       
        'ZK1248.6' => { 'sequence' => 'atggcccaatccgtcccaccg...', 
                       ...
                     },
         ...
      }
      

    Running

      $x = threeframes(unwrap('/net/share/perl_refs/cosmids1.fasta'));
      print $x->{'ZK1307.9'}->{'frame1'}, "\n";
      print $x->{'ZK1248.6'}->{'frame1'}, "\n";
      print $x->{'ZK1248.6'}->{'frame3'}, "\n";
      

    should produce

      atg gga gag cgt aaa gga...
      atg gcc caa tcc gtc cca...
      ggc cca atc cgt ccc acc...
      

    Hint (sketch only, not working code):

    sub threeframes {
      # Get argument $inhash
      my $result = {};
      for (keys %{$inhash}) {
        my $seq = $inhash->{$_};
        $result->{$_}->{'sequence'} = $seq;
        # Do frame 1
        my @frames = # .. you know how to do this ..
        $result->{$_}->{'frame1'} = join(' ', @frames);
        # Do frame 2 
        $seq = substr($seq, 1);
        @frames = # .. you know how to do this ..
        $result->{$_}->{'frame2'} = join(' ', @frames);
        # Do frame 3
        ...
      }
      $result;
    }
      

    Sample solution here.

  5. Modify sub threeframes so that 'frame1' points to an ARRAY ref, each element of which is one codon.