Thursday, January 31, 2008

Integrative Bioinformatics 7

[Index of classes ]

January 31 - Review of homework assignments - dynamic programming for sequence alignment and regular expressions:

  1. Dynamic programming of Needleman-Wunch alignment algorithm.
  2. Review of solutions to other alignment algorithms, note storing matrix indexes for traceback proceedures.
  3. Visualization of scoring dynamic programming matrices.
  4. Regular expression for tag delimitation in tabular formats.
  5. Regular expression chalange.
  6. Recommended new reading - upcomming chalanges/opportunities in sequence analysis:

    A Plan to Capture Human Diversity in 1000 Genomes [Science 25 January 2008]
    Scientists Synthesize a Genome From Scratch [ScienceNOW Daily News24 January 2008]

HOMEWORK
(note old solutions at http://ibl.mdanderson.org/~jalmeida/IB2008/)

Write you favorite implementation of a UCSC client in matlab and include with your m-files a archive html report describing its use, produced with cell programming + publish.

Tuesday, January 29, 2008

Integrative Bioinformatics 6

[Index of classes ]

January 24 - hands on session developing a client for UCSC Genome Browser (everybody):

  1. BioDAS (Diogo)
  2. Review of interoperable solutions.
  3. Introduction to the Document Object Model (DOM) and how it can be mapped into a regular data structure using XML. See for example Matlab's XMLREAD command and compare it with this tool box.

Integrative genomics using DAS : http://www.biomedcentral.com/1471-2105/8/333

HOMEWORK

Don't forget pending assignments: Write matlab function that reads HTML table into cell array and also the alignment homework. This is due Thursday,.



Biodas queries against UCSC Genome Database

%% Data sources command

%Get all genomes (data sources) available. XML looks like this:
%
%
% Mar. 2006 at UCSC
% http://genome.cse.ucsc.edu:80/cgi-bin/das/hg18
% Human Mar. 2006 Genome at UCSC
%

% ...
%

% this code is for manipulating using DOM API
genomes_Dom = xmlread('http://genome.ucsc.edu/cgi-bin/das/dsn');
dsnL = genomes_Dom.getElementsByTagName('DSN');
sourceL = genomes_Dom.getElementsByTagName('SOURCE');
descrL = genomes_Dom.getElementsByTagName('description');

% this is code for manipulating using XML Toolbox from Geodise (download it first)
xml = urlread('http://genome.ucsc.edu/cgi-bin/das/dsn');
t_parseany = xml_parseany(xml);

%when comes to a leave, use this:
t_parseany.DSN{1}.SOURCE{1}.ATTRIBUTE(1)
t_parseany.DSN{1}.SOURCE{1}.CONTENT

%% entry_points command

%Get all chromosomes (entry_points command). Here, hg16 refers to a specific genome assembly (Human Genome July 2003), as returned w/ the data source command.
%eg.
urlread('http://genome.ucsc.edu/cgi-bin/das/hg16/entry_points')

%% types command

%Get annotation types for a sequence segment. Annotation types are all features annotated for sequence, as knownGenes, snps, mRNAs, ESTs, exons, introns and so on.
%This command give us a overview of the annotation for the sequence, including the number of each feature. XML looks like this:
urlread('http://genome.ucsc.edu/cgi-bin/das/hg16/types?segment=4:3000000,4000000');

%% types command with parameters

%Get one or more specific features for sequence. Several features and/or segments can be queried at same time.
%eg. Retrieve all SNPs and identified genes for the segment.
urlread('http://genome.ucsc.edu/cgi-bin/das/hg16/features?segment=4:3000000,4000000;type=snp;type=knownGene');

%% dna command

%Get raw nucleotide sequence data.
urlread('http://genome.ucsc.edu/cgi-bin/das/hg16/dna?segment=chr4:30000,300100');







list of algorithms for copy number aberration

1. Assessing the Significance of Conserved Genomic Aberrations Using High Resolution Genomic Microarrays. http://genetics.plosjournals.org/perlserv/?request=get-document&doi=10.1371%2Fjournal.pgen.0030143

2. Computation of recurrent minimal genomic alterations from array-CGH data http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/7/849

3. STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments

http://www.genome.org/cgi/content/abstract/gr.5076506v1

4. SIRAC: Supervised Identification of Regions of Aberration in aCGH datasets

http://www.biomedcentral.com/content/pdf/1471-2105-8-422.pdf

5. A Robust Algorithm for Copy Number Detection Using High-Density Oligonucleotide Single Nucleotide Polymorphism Genotyping Arrays

http://cancerres.aacrjournals.org/cgi/content/abstract/65/14/6071

6. Modeling recurrent DNA copy number alterations in array CGH data

http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/13/i450


7.
Efficient Calculation of Interval Scores for DNA Copy Number Data Analysis

http://www.liebertonline.com/doi/abs/10.1089/cmb.2006.13.215?journalCode=cmb


8. Assessing the significance of chromosomalaberrations in cancer: Methodology and application to glioma

http://www.pnas.org/cgi/content/full/104/50/20007




Data analysis and bioinformatics tools for tandem mass spectrometry in proteomics

link to

Data analysis and bioinformatics tools for tandem mass spectrometry in proteomics

PRIDE: the proteomics identifications database.

link to

PRIDE: the proteomics identifications database.

Monday, January 28, 2008

Flexible Informatics for Linking Experimental Data to Mathematical Models Via DataRail

Link to Bioinformatics Paper

ABSTRACT
Motivation: Linking experimental data to mathematical models in
biology is impeded by the lack of suitable software to manage and
transform data. Model calibration would be facilitated and models
would increase in value were it possible to preserve links to training
data along with a record of all normalization, scaling, and fusion
routines used to assemble the training data from primary results.
Results: We describe the implementation of DataRail, an open
source MATLAB-based toolbox that stores experimental data in
flexible multi-dimensional arrays, transforms arrays so as to maximize
information content, and then constructs models using internal
or external tools. Data integrity is maintained via a containment hierarchy
for arrays, imposition of a metadata standard based on a
newly proposed MIDAS format, assignment of semantically typed
universal identifiers, and implementation of a procedure for storing
the history of all transformations with the array. We illustrate the
utility of DataRail by processing a newly collected set of ~22,000
measurements of protein activities obtained from cytokinestimulated
primary and transformed human liver cells.
Availability: DataRail is distributed under the GNU General Public
License and available at http://code.google.com/p/sbpipeline/
Contact: sbpipeline@hms.harvard.edu
Supplementary information: accompanies this paper.

Thursday, January 24, 2008

Integrative Bioinformatics 5

[Index of classes ]

January 24 - hands on session developing a client for UCSC Genome Browser (everybody):

  1. Quick introduction to REST vs SOAP + revisit precedent CORBA, EMAIL, port chanelling etc
  2. Quick Introduction to upcomming data base and computational statistics modules.
  3. Hands on session.

HOMEWORK

Don't for get last sessions's: Write matlab function that reads HTML table into cell array.

Tuesday, January 22, 2008

Integrative Bioinformatics 4

[Index of classes ]

January 22 - Data structures and data services (Pablo, Jonas):

  1. Extracting data structures from data services.
  2. Attribute/Value pairs --> XML --> RDF tripples
  3. Regular Expressions.
  4. UCSC Genome browser as a data service providing an aggragating data structure. For online tutorials see this page.

HOMEWORK

Write matlab function that reads HTML table into cell array.

Thursday, January 17, 2008

Integrative Bioinformatics 3

[Index of classes ]

1. Discussion of matrix notation using teh homework assignment. [my solution].

2. Alignment as a similarity metric. [Presentation].

3. Discussion of collective assignment on developing a client that will use UCSC Genome Browser as a data service.

HOMEWORK

Since you did so well in the introductory class, today we move to an advance algorithm deployment assignment. The Homework is described in the last slide of the presentation.

Tuesday, January 15, 2008

Integrated Bioinformatics 2008 2

[Index of classes ]

This class will introduce the two main components of the integrative exercise: data structures and programming languages. The exploration of these two topics will be pursued in MATLAB, a fast prototyping scientific and engineering programming environment.

In addition to the very extensive help material that comes with MATLAB (from manuals to viodeos, clisck on "Help" in the top menu to find more), Mathworks' website also includes a great selection of webminars.

HOMEWORK

Today we have a small homework assignment just to make sure we all know how to send them to me: write a m-function that identifies the largest element of a matrix and return their location.