Title: The UCSC Code Base: general principles and examples
1The UCSC Code Basegeneral principles and
examples
2Lagging Edge Software
- C language - compilers still available!
- CGI Scripts - portable if not pretty.
- SQL database - at least MySQL is free.
3Problems with C
- Missing booleans and strings.
- No real objects.
- Must free things
4Coping with Missing Data Types in C
- define boolean int
- Fixing lack of real string type much harder
- lineFile/common modules and autoSql code
generator make parsing files relatively painless - dyString module not a horrible string class
5Object Oriented Programming in C
- Build objects around structures.
- Make families of functions with names that start
with the structure name, and that take the
structure as the first argument. - Implement polymorphism/virtual functions with
function pointers in structure. - Inheritance is still difficult. Perhaps this is
not such a bad thing.
6- struct dnaSeq
- / A dna sequence in one-letter-per-base format.
/ -
- struct dnaSeq next / Next in list. /
- char name / Sequence name. /
- char dna / as cs gs and ts. Null
terminated / - int size / Number of bases. /
-
- struct dnaSeq dnaSeqFromString(char string)
- / Convert string containing sequence and
possibly - white space and numbers to a dnaSeq. /
- void dnaSeqFree(struct dnaSeq pSeq)
- / Free dnaSeq and set pointer to NULL. /
- void dnaSeqFreeList(struct dnaSeq pList)
- / Free list of dnaSeqs. /
7- struct screenObj
- / A two dimensional object in a sleazy video
game. / -
- struct screenObj next / Next in list. /
- char name / Object name. /
- int x,y,width,height / Bounds of object.
/ - void (draw)(struct screenObj obj) / Draw
object / - boolean (in)(struct screenObj obj, int x,
int y) - / Return true if x,y is in
object / - void custom / Custom data for a
particular type / - void (freeCustom)(struct screenObj obj)
- / Free custom data. /
-
- define screenObjDraw(obj) (obj-gtdraw(obj))
- / Draw object. /
- void screenObjFree(struct screenObj pObj)
- / Free up screen object including custom part. /
8Freeing Dynamic Memory
- Make sure all pointers start out NULL.
- Routines go through needMem/freeMem not
malloc/free. - Pointers set to NULL when freed
freeMem(buf) // sets buf to NULL. - AutoSql generates objFree objFreeList.
- Local Mem module sometimes useful for bulk
alloc/free. Used by aligners. - In CGIs, and many command line utilities dont
need to free everything. Leaking memory until
program exit is fine.
9Name nitpicking
- var_name, varname, VARNAME, VarName, varName?
- We use varName.
- Abbreviations are typically 1st three letters of
word, or 1st letter of each word. - variable -gt var
- local memory pool -gt lmp
- Abbreviations and numbers are considered a word
for capitalization purposes - bigDnaBuffer
- read454Data
- Consistent variable naming conventions make it
easier to remember what you called something.
10Relational Databases
- Relational databases consist of tables, indices,
and the Structured Query Language (SQL). - Tables are much like tab-separated files
chrom start end name strand score
chr22 14600000 14612345 ldlr
0.989 chr21 18283999 18298577 vldlr -
0.998Fields are simple - no lists or
substructures. - Can join tables based on a shared field. This is
flexible, but only as fast as the index. - Tables and joins are accessed a row at a time.
- The row is represented as an array of strings.
11Converting A Row to Object
struct exoFish exoFishLoad(char row) / Load a
exoFish from row fetched with select from
exoFish from database. Dispose of this with
exoFishFree(). / struct exoFish
ret AllocVar(ret) ret-gtchrom
cloneString(row0) ret-gtchromStart
sqlUnsigned(row1) ret-gtchromEnd
sqlUnsigned(row2) ret-gtname
cloneString(row3) ret-gtscore
sqlUnsigned(row4) return ret
12Motivation for AutoSql
- Row to object code is tedious at best.
- Also have save object, free object code to write.
- SQL create statement needs to match C structure.
- Lack of lists without doing a join can seriously
impact performance and complicate schema.
13AutoSql Data Declaration
table exoFish "An evolutionarily conserved region
(ecore) with Tetroadon" ( string chrom
"Human chromosome or FPC contig" uint
chromStart "Start position in chromosome"
uint chromEnd "End position in
chromosome" string name "Ecore name
in Genoscope database" uint score
"Score from 0 to 1000" )
See autoSql.doc for more details.
14Checking out the Code
- Use CVS to check out kent/src. CVS helps many
people work on the same code at once. - Log onto hgwdev
- cvs checkout -d kent
- Mkdir /bin/x86_64
- cp /usr/local/apache/cgi-bin/hg.conf /.hg.conf
- The source will be in kent/src under your home
dir - The binaries will end up in /bin/x86_64 mostly.
- Youll have read access to database (.hg.conf)
15Organization of the code
- General purpose library modules
- src/lib/.c src/inc/.h
- Human genome project library modules
- src/hg/lib/.c,.sql,.as src/hg/inc/.h
- Alignment tools
- Pairwise genome alignment src/hg/mouseStuff
- Multiple genome alignment src/hg/ratStuff
- cDNA alignment src/hg/psl, src/blat
- Database loaders src/hg/makeDb
- Database build docs src/hg/makeDb/doc
16Project Work Flow
- Create subdirectory of src/hg for your project.
- Use newProg to create shell of a new command line
utility. - Check new programs or modifications of old
programs into CVS when they seem to work. - Participate in weekly pair reviews of checked in
code. - Use pushQ to notify QA group of new code or data
destined for public web site. - Dont be shy asking for advice or help from
browser developers, especially from Jim, Hiram,
Kate, the senior browser developers.
17Gene Code Demo
- Finding longest gene in hg18 (latest human)
- In SQL
- In C
- Finding longest intron in C
- From database
- From flat file
18List and hash demo
- Find out how many SNPs are both in Affy250k and
Illumina 300k arrays. - snpArrayAffy250Nsp.rsId
- snpArrayIllumina300.name
19Working with range tree and hChromQuery
- Range tree - keeps efficiently set of
non-intersecting features that can access by
range. - Illumina 300k SNPs that intersect most
repeatMasker elements. - snpArrayIllumina300
- chr1_rmsk, chr2_rmsk, etc.
20Working with binKeeper
- binKeeper - keeps a collection of possibly
intersecting features for range queries. - chromKeeper - a binKeeper for each chromosome.
- Illumina 300k SNPs that intersect with known gene
exons. - snpArrayIllumina300
21Conclusion
- Its always safer on the lagging edge
- Consider redesigning system as COBOL
character-based application