Tuesday, March 27, 2007

How much does it cost to build an ontology? Answer: 3 t-shirts, 4 coffee mugs, and one chocolate moose


Link to Paper

During two days at a conference focused on circulatory and respiratory health, 68 volunteers untrained in knowledge engineering participated in an experimental knowledge capture exercise. These volunteers created a shared vocabulary of 661 terms, linking these terms to each other and to a pre-existing upper ontology by adding 245 hyponym relationships and 340 synonym relationships. While ontology-building has proved to be an expensive and labor-intensive process using most existing methodologies, the rudimentary ontology constructed in this study was composed in only two days at a cost of only 3 t-shirts, 4 coffee mugs, and one chocolate moose. The protocol used to create and evaluate this ontology involved a targeted, web-based interface. The design and implementation of this protocol is discussed along with quantitative and qualitative assessments of the constructed ontology.

Thursday, March 22, 2007

Integrated Bioinformatics 1

[Index of classes ]

This first class will outline the three main sections covered in this course:

a) Programming environments - enphasis on the minimum toolkit made of Matlab, PHP, Javascript.
b) Data structures - emphasis on using mat, XML, RDF.
c) Integrated data managment and analysis tools - S3DB, Bioinformatics Station.

This session also seeks to establish what programming tools, knowledge and computing resources do the participants pocess. The teaching material will be made available on line with each session. Each class ends with a homework assignment which is also the first topic in teh next session.

The structure of this course will be constructive before being instructive. The first few sessions aim at making the participants familiar, and participative, with the collaborative data management and analysis tools being developed and integrated at MDACC. Therefore, the first aim of this course is to enable those who produce data and those who analyze it to interoperate. Once that is established, this constructive aim will be succeeded by an instructive emphasis on algorithm identification and deployment. This hands on computational statistics component will be complemented with some elements of emerging Integrative Bioinformatics theory.


Those who want Katherine's data please email me your email!

Integrative Bioinformatics

Integrative Bioinformatics (IB) has emerged as the label of choice to describe the development of integrated data management and data analysis infrastructure for the life sciences. The elaboration of the technology required, the sophistication of computational statistics methodologies involved, and the potential for formal abstract representation of very complex Biological phenomena systemically, has led to the coalescence of Integrative Bioinformatics as a research topic of its own. Accordingly, a graduate research program was put in place to train newcomers in this field.

The graduate training in integrative bioinformatics includes a formal program covering founding elements of Biology, Computer Science and Statistics. However, the defining characteristic of the training program, and indeed of the research practice is this field, is its integrative focus. Accordingly, this material is delivered in a problem solving format geared towards the identification and deployment of algorithmic solutions that interoperate with the global, mostly public, suite of bioinformatic resources. This choice of format also anticipates the graduate research itself which includes bioinformatic tool making.

Active courses:

Integrative Bioinformatics 2008 [GS01 0123]

Location HMB 13.356, Tuesdays and Thursdays 11am-1pm


  1. [Jan 8] - Introduction (Jonas)
  2. [Jan 15] - Introduction to programming and data structures in MATLAB. (Jonas)
  3. [Jan 17] - Algorithm deployment illustrated for alignment as a metric. (Jonas)
  4. [Jan 22] - Data structures and data services (Pablo, Jonas)
  5. [Jan 24] - hands on session developing a client for UCSC Genome Browser (everybody)
  6. [Jan 29] - Document object model (DOM) and XML as vehicles for interoperability.
  7. [Jan 31] - Dynamic programming and regular expression homework.
  8. [Feb 05] - Design and practice of Graphic User Interface development.
  9. [Feb 07] - more on GUIs. Discussion of the DILS 2008 challenge.
  10. [Feb 12] - S3DB: a distributed, semantically explicit, RESTful, DBMS.
  11. [Feb 14] - Continuation of last class: S3DB.
  12. [Feb 19] - Homework review of UCSC client (Diogo > Lena > Chunyan > Rys > David)
  13. [Feb 21] - Common Standards vs Common Protocols [Romesh Stanislaus presents]
  14. [Feb 26] - GUI example 1/4 : TCGA client.

Past courses:

Integrative Bioinformatics 2007

Location HMB 13.356, Thursdays 2-4 pm

  1. [March 29 ] - Introduction (Jonas)
  2. [April 5] - Introduction to Data Structures (Jonas)
  3. [April 12] - Semantic constructs using S3DB (Lena)
  4. [April 19] - XML constructs (Romesh & Lena)
  5. [April 27] - Case Study: Ovary Cancer data integration at the Kleberg Foundation (Jonas)
  6. [May 3] - Menial Bioinformatics - modelling strings for parsing data files (Jonas).
  7. [May 10] - XSLT, XPath, XQuery, XML I/O.
  8. [May 16] - NO CLASS TODAY, we are participating in SOA meeting at MDACC. If you want to attend please email me (Jonas) and I'll send you the directions to teh seminar room.
  9. [May 24] - Last session: overview of multivariate exploratory and disciminant statistical analysis methods. Here's the good bey and thank you Note, in 2008 we hope to have IB as a formal topic of GSBS:

-------- Original Message --------
Subject: Classes are over
Date: Thu, 31 May 2007 09:59:02 -0500
From: Jonas S Almeida
To: Helena Deus , Jonas Almeida , jiexinzhang@mdacc.tmc.edu, jttseng@mdanderson.org, meiliu@mdanderson.org, Helena Deus , rhysm.adams@gmail.com, wcarter@mdanderson.org, pablorf@lncc.br, Pablo Freire , Katherine Hale

Hi everybody,just in case any of you missed this and is heading for the integrative bioinformatics class today: the classes ended last week with the overview of multivariate exploratory and discriminant analysis methods. I also want to take this opportunity to thank you all so much for your participation in this class. A special thank you to Katherine for bringing her datasets and Biology problems to the table and sharing them with us. I look forward to keep interacting with you. Maybe we should set specialized workshops to frame specific collaborations. In any case, I'll use this first run of the Integrative Bioinformatics class to request it to be added as a subject of GSBS.cheers,

JonasJonas S Almeida, Professor
Dept Bioinformatics and Computational Biology
Univ. Texas MDAnderson Cancer Center - unit 237
1515 Holcombe Blvd, Houston TX 77030-4009, USA
Email: jalmeida@mathbiol.org
Tel: 713 792 9875 ;
fax: 305 574 5818
Vitae: http://jonasalmeida.info/
Lab: http://ibl.mdanderson.org/

Tuesday, March 20, 2007

To ontology or not to ontology?

That is the million dollar (billion :)) question. Read it and come to your own conclusions...
1)New technologies will make online search more intelligent--and may even lead to a "Web 3.0." part 1 and part 2
2)The Darker Side of the Semantic Web click here
3)Adaptation, or more likely contortion, of these technologies to be biologically relevant... coming soon...

Monday, March 19, 2007

Saturday, March 17, 2007

The Third Erich L. Lehmann Symposium takes place May 16 - 19, 2007, at Rice University. Deadline for early registration and submission of contributed talks is April 22.

List of Sessions:

1. Statistical problems in the analysis of genomic and magnetic resonance imaging data.
2. Modeling correlated biomedical data.
3. Multiplicity: Developments and current issues.
4. Multiple testing and subgroup analysis.
5. Probability, Levy Process, and Applications.
6. Regularized methods of classification and estimation of nonparametric regression and covariance matrices when data is high dimensional.
7. Statistical Inference for Population Substructures via Clustering, Mixture Models and other Approaches.
8. Statistical Optimality in Bioinformatics: Theory vs Practice.

Friday, March 16, 2007

PodCast on SPARQL and the Semantic Web

In this ITConversations podcast reccorded 2006-07-17 [MP3], Elias Torres, a senior software engineer at IBM and a member of several W3C working groups, gives us an overview of the Semantic Web and how RDF and SPARQL are set to become the tools of choice when extracting data from the World Wide Web. In an interview, hosted by Phil Windley, Torres discusses what has happened to this technology in the past, where it is hopefully going in the near future, and what you can do today to take advantage of it.

Monday, March 12, 2007

Ontology Evolution: Not the Same as Schema Evolution

Knowledge and Information Systems,Volume 6, Number 4 / July, 2004

This paper addresses some important issued in data integration:
1) how the adoption of standards causes changes in the domain of discourse
2) how the application of the ontology to particular tasks, either data analysis or visualization of data sets, causes the emergience of concurring ontology views (conceptualization)
3) how the translation of an ontology from one knowledge representation language to another affects the specification of the ontology

As ontology development becomes a more ubiquitous and collaborative process, ontology versioning and evolution becomes an important area of ontology research. The many similarities between database-schema evolution and ontology evolution will allow us to build on the extensive research in schema evolution. However, there are also important differences between database schemas and ontologies. The differences stem from different usage paradigms, the presence of explicit semantics and different knowledge models. A lot of problems that existed only in theory in database research come to the forefront as practical problems in ontology evolution. These differences have important implications for the development of ontology-evolution frameworks: The traditional distinction between versioning and evolution is not applicable to ontologies. There are several dimensions along which compatibility between versions must be considered. The set of change operations for ontologies is different. We must develop automatic techniques for finding similarities and differences between versions.

Wednesday, March 07, 2007

Clinical proteomics: A need to define the field and to begin to set adequate standards

[PROTEOMICS - Clinical Applications (2007) 1 (2) 148-156]

A team of 26 authors has attempted to suggest initial and as yet preliminary guidelines for clinical proteome analysis. As stated in the abstract "the aim of this manuscript is to initiate a constructive discussion about the definition of clinical proteomics, study requirements, pitfalls and (potential) use".

Note in particular Table 1 with good practice recommendations for and from experimentalists.

3) Projects

Integrated Data Management (S3DB) and Analysis (BiS)

The various projects at IBL rely on the articulation between two infrastructure resources. One is a semantic database where arbitrary data structures can be stored and managed. This work was described in two reports in Nature Biotech: the rationale at 2005 Sep; 23(9):1099-103, and teh application at 2006 Sep, 24(9):1070-1071. The prototype application, S3DB, is now in use by several experimental groups. The second resource is a code distribution tool that synchronizes client machines with a data analysis environment maintained in a central repository of applicatrions. This tool is designated as Bioinformatics Station (BiS) and, just like S3DB, its analytical modules are made publicly avaiable with open source.

The various projects at IBL are therefore pursued as interoperable modules of a common infrastructure. Most of them rely of collaborations with extramural research groups for theory development, algorithm identification and to a lesser degree application deployment. This creates a de facto distributed research group for which IBL is a front end. For more information about the research work leading to the individual modules of BiS please see www.MathBiol.org.

Tuesday, March 06, 2007

Digital Future of the United States: Part I - The Future of the World Wide Web

Tim Berners-Lee congressional hearing March 2, 2007 on the topic "Digital Future of the United States: Part I - The Future of the World Wide Web".

For information on Time Berner-Lee start with his entry on wikipedia and then proceed to his own page at W3C. This hearing was hosted by the Subcommittee on Telecommunications and the Internet, Energy and Commerce Commitee, of the US House of representatives.

Note mention of the Mayo Clinical medical reccord system @20min even before the hearing starts and a few times later. The actual hearing only starts afetr 30 min introduction by the congressional sub-cPlease add you own notes to this entry when you hear it to help the rest of us go back to particular excerpts. Here is the direct link to the podcast.

Data integration at ~ 00:41:40
Life Sciences data integration through semantic web ~ 00:44:00 mins
Web Science at min 48, this topic is expanded in Science. 2006 Aug 11;313(5788):769-71.

--- End of TBL presentation at ~min 58 ---
--- Questioning started at 1:24:30 ---

> 1:24:30 - property rights

2) People

Jonas S Almeida, PhD, Professor,
Laboratory director,
updated list of publications and brief vitae at jonasalmeida.info.

Romesh Stanislaus, PhD, Instructor
Dept of Bioinformatics & Computational Biology
Member IBL

Yuliya Karpievitch, Graduate Student
Dept of Bioinformatics & Computational Biology http://ibl.mdanderson.org/~yuliya/resume_yk.pdf

Helena F Deus

Graduate Student
Dept Bioinformatics and Computational Biology @ MDAnderson Cancer Center
Biomathematics group @ ITQB/UNL - Portugal

Marco Vilela, Graduate Student
Dept of Bioinformatics & Computational Biology

Member IBL


Pablo Freire, Graduate Student
(Pablo, please fill in + we need your pic)


1) Mission Statement

The Integrative Bioinformatics Laboratory (IBL) conducts research on integrated management and analysis of biomolecular data.

IBL is a research unit of the Dept of Bioinformatics and Computational Biology, Division of Quantitative Sciences of The University of Texas Mdanderson Cancer Center, at Houston, Texas. IBL developed and maintains a computational framework for interoperable data management and analysis where predictive modeling in systems Biology. This integrative mission is purssued through theory development, algorithm identification and deployment of data management infrastructure. These prototypes are developed in response to the specific need for seamless systemic integration in biomedical research.

From Bytes to Bedside

From Bytes to Bedside - Data Integration and Computational Biology for Translational Cancer Research. [PLoS Comput Biol 3(2)]

Mathew JP, Taylor BS, Bader GD, Pyarajan S, Antoniotti M, Chinnaiyan AM, Sander C, Burakoff SJ, Mishra B.

Major advances in genome science and molecular technologies provide new opportunities at the interface between basic biological research and medical practice. The unprecedented completeness, accuracy, and volume of genomic and molecular data necessitate a new kind of computational biology for translational research. Key challenges are standardization of data capture and communication, organization of easily accessible repositories, and algorithms for integrated analysis based on heterogeneous sources of information. Also required are new ways of using complementary clinical and biological data, such as computational methods for predicting disease phenotype from molecular and genetic profiling. New combined experimental and computational methods hold the promise of more accurate diagnosis and prognosis as well as more effective prevention and therapy.