The UCSC Code Base: general principles and examples - PowerPoint PPT Presentation

About This Presentation
Title:

The UCSC Code Base: general principles and examples

Description:

Finding longest intron in C. From database. From flat file. List and hash demo. Find out how many SNPs are both in Affy250k and Illumina 300k arrays. ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 22
Provided by: jimk88
Category:

less

Transcript and Presenter's Notes

Title: The UCSC Code Base: general principles and examples


1
The UCSC Code Basegeneral principles and
examples
2
Lagging Edge Software
  • C language - compilers still available!
  • CGI Scripts - portable if not pretty.
  • SQL database - at least MySQL is free.

3
Problems with C
  • Missing booleans and strings.
  • No real objects.
  • Must free things

4
Coping with Missing Data Types in C
  • define boolean int
  • Fixing lack of real string type much harder
  • lineFile/common modules and autoSql code
    generator make parsing files relatively painless
  • dyString module not a horrible string class

5
Object Oriented Programming in C
  • Build objects around structures.
  • Make families of functions with names that start
    with the structure name, and that take the
    structure as the first argument.
  • Implement polymorphism/virtual functions with
    function pointers in structure.
  • Inheritance is still difficult. Perhaps this is
    not such a bad thing.

6
  • struct dnaSeq
  • / A dna sequence in one-letter-per-base format.
    /
  • struct dnaSeq next / Next in list. /
  • char name / Sequence name. /
  • char dna / as cs gs and ts. Null
    terminated /
  • int size / Number of bases. /
  • struct dnaSeq dnaSeqFromString(char string)
  • / Convert string containing sequence and
    possibly
  • white space and numbers to a dnaSeq. /
  • void dnaSeqFree(struct dnaSeq pSeq)
  • / Free dnaSeq and set pointer to NULL. /
  • void dnaSeqFreeList(struct dnaSeq pList)
  • / Free list of dnaSeqs. /

7
  • struct screenObj
  • / A two dimensional object in a sleazy video
    game. /
  • struct screenObj next / Next in list. /
  • char name / Object name. /
  • int x,y,width,height / Bounds of object.
    /
  • void (draw)(struct screenObj obj) / Draw
    object /
  • boolean (in)(struct screenObj obj, int x,
    int y)
  • / Return true if x,y is in
    object /
  • void custom / Custom data for a
    particular type /
  • void (freeCustom)(struct screenObj obj)
  • / Free custom data. /
  • define screenObjDraw(obj) (obj-gtdraw(obj))
  • / Draw object. /
  • void screenObjFree(struct screenObj pObj)
  • / Free up screen object including custom part. /

8
Freeing Dynamic Memory
  • Make sure all pointers start out NULL.
  • Routines go through needMem/freeMem not
    malloc/free.
  • Pointers set to NULL when freed
    freeMem(buf) // sets buf to NULL.
  • AutoSql generates objFree objFreeList.
  • Local Mem module sometimes useful for bulk
    alloc/free. Used by aligners.
  • In CGIs, and many command line utilities dont
    need to free everything. Leaking memory until
    program exit is fine.

9
Name nitpicking
  • var_name, varname, VARNAME, VarName, varName?
  • We use varName.
  • Abbreviations are typically 1st three letters of
    word, or 1st letter of each word.
  • variable -gt var
  • local memory pool -gt lmp
  • Abbreviations and numbers are considered a word
    for capitalization purposes
  • bigDnaBuffer
  • read454Data
  • Consistent variable naming conventions make it
    easier to remember what you called something.

10
Relational Databases
  • Relational databases consist of tables, indices,
    and the Structured Query Language (SQL).
  • Tables are much like tab-separated files
    chrom start end name strand score
    chr22 14600000 14612345 ldlr
    0.989 chr21 18283999 18298577 vldlr -
    0.998Fields are simple - no lists or
    substructures.
  • Can join tables based on a shared field. This is
    flexible, but only as fast as the index.
  • Tables and joins are accessed a row at a time.
  • The row is represented as an array of strings.

11
Converting A Row to Object
struct exoFish exoFishLoad(char row) / Load a
exoFish from row fetched with select from
exoFish from database. Dispose of this with
exoFishFree(). / struct exoFish
ret AllocVar(ret) ret-gtchrom
cloneString(row0) ret-gtchromStart
sqlUnsigned(row1) ret-gtchromEnd
sqlUnsigned(row2) ret-gtname
cloneString(row3) ret-gtscore
sqlUnsigned(row4) return ret
12
Motivation for AutoSql
  • Row to object code is tedious at best.
  • Also have save object, free object code to write.
  • SQL create statement needs to match C structure.
  • Lack of lists without doing a join can seriously
    impact performance and complicate schema.

13
AutoSql Data Declaration
table exoFish "An evolutionarily conserved region
(ecore) with Tetroadon" ( string chrom
"Human chromosome or FPC contig" uint
chromStart "Start position in chromosome"
uint chromEnd "End position in
chromosome" string name "Ecore name
in Genoscope database" uint score
"Score from 0 to 1000" )
See autoSql.doc for more details.
14
Checking out the Code
  • Use CVS to check out kent/src. CVS helps many
    people work on the same code at once.
  • Log onto hgwdev
  • cvs checkout -d kent
  • Mkdir /bin/x86_64
  • cp /usr/local/apache/cgi-bin/hg.conf /.hg.conf
  • The source will be in kent/src under your home
    dir
  • The binaries will end up in /bin/x86_64 mostly.
  • Youll have read access to database (.hg.conf)

15
Organization of the code
  • General purpose library modules
  • src/lib/.c src/inc/.h
  • Human genome project library modules
  • src/hg/lib/.c,.sql,.as src/hg/inc/.h
  • Alignment tools
  • Pairwise genome alignment src/hg/mouseStuff
  • Multiple genome alignment src/hg/ratStuff
  • cDNA alignment src/hg/psl, src/blat
  • Database loaders src/hg/makeDb
  • Database build docs src/hg/makeDb/doc

16
Project Work Flow
  • Create subdirectory of src/hg for your project.
  • Use newProg to create shell of a new command line
    utility.
  • Check new programs or modifications of old
    programs into CVS when they seem to work.
  • Participate in weekly pair reviews of checked in
    code.
  • Use pushQ to notify QA group of new code or data
    destined for public web site.
  • Dont be shy asking for advice or help from
    browser developers, especially from Jim, Hiram,
    Kate, the senior browser developers.

17
Gene Code Demo
  • Finding longest gene in hg18 (latest human)
  • In SQL
  • In C
  • Finding longest intron in C
  • From database
  • From flat file

18
List and hash demo
  • Find out how many SNPs are both in Affy250k and
    Illumina 300k arrays.
  • snpArrayAffy250Nsp.rsId
  • snpArrayIllumina300.name

19
Working with range tree and hChromQuery
  • Range tree - keeps efficiently set of
    non-intersecting features that can access by
    range.
  • Illumina 300k SNPs that intersect most
    repeatMasker elements.
  • snpArrayIllumina300
  • chr1_rmsk, chr2_rmsk, etc.

20
Working with binKeeper
  • binKeeper - keeps a collection of possibly
    intersecting features for range queries.
  • chromKeeper - a binKeeper for each chromosome.
  • Illumina 300k SNPs that intersect with known gene
    exons.
  • snpArrayIllumina300

21
Conclusion
  • Its always safer on the lagging edge
  • Consider redesigning system as COBOL
    character-based application
Write a Comment
User Comments (0)
About PowerShow.com