Fs [27,28]. The most powerful generalization of this idea would be to turn motif finding into a feature selection problem in regression analysis by asking what is the set of features X (some functions of the motifs or CRMs) that can best explain the microarray data Y (e.g. expression scores). This is very similar to the general problem in genetics: Y represents the phenotype (mRNA expression) and X represents the genotype (promoter DNA elements). One would like to learn a model (function f) so that f(X) can best predict Y. When “best” is measured by the average squared error based on the distribution Pr(X, Y), the solution is the conditional expectation (also known as the regression function, see, e.g. [29]): f(X) = E (Y| X = x). REDUCE was theTwo new de novo promoter prediction algorithms have emerged that further improve in accuracy. One is ARTS [45], which is based on Support Vector Machines with multiple sophisticated sequence kernels. It claims to find about 35 true positives at a false positive rate of 1/1000, where the above mentioned methods find only about half as many true positives (18 ). ARTS uses only downstream genic sequences as the negative set (non-promoters), and therefore it may get more false-positives from upstream non-genic regions. Furthermore, ARTS does not distinquish if a promoter is CpG-island related or not and it is not clear how ARTS may peform on non-CpG-island related promoters. Another novel TSS prediction algo-Page 3 of(page number not for citation purposes)BMC Bioinformatics 2007, 8(Suppl 6):Shttp://www.ML390 site biomedcentral.com/1471-2105/8/S6/Srithm is CoreBoost [46] which is based on simple LogitBoosting with stumps. It has a false positive rate of 1/5000 at the same sensitivity level (Zhao, personal communication). CoreBoost uses both immediate upstream PubMed ID:https://www.ncbi.nlm.nih.gov/pubmed/25447644 and downstream fragments as negative sets and trains separate classifiers for each before combining the two. The training sample is 300 bp fragments (-250, +50), hence it is more localized than ARTS which has training sample of 2 kb fragments (-1 kb, +1 kb). The ideal application of TSS prediction algorithms is to combine them with gene prediction algorithms [21] and/or with the ChIP-chip PIC mapping data [14].3. 4.5. 6.7. 8.Future direction: epigenetics and chromatin statesAlthough much progress has been made in promoter prediction and cis-regulatory motif discovery, false-positives are still the main problem when scanning through the whole genome. Fundamentally this is because the information about chromatin structure is still missing in all our models! Protein-DNA binding specificity is partly determined by the energetics and partly determined by “entropy”, which depends on how much of the genome is accessible to the DNA binding protein [47] Without knowing which regions of chromatin are open or closed (and to what degree), researchers have to assume the whole genome is accessible for binding, which is obviously wrong and will lead to more false positives (and false negatives because of the extra noise). This is clearly shown by recent genome-wide ChIP-chip data as well as DNase I Hypersensitivity mapping data. There is a necessity for higher order prediction algorithms that are capable of predicting chromatin states based upon, perhaps, genome-wide epigenetic measurements, CpG-islands and repeat characteristics in addition to genomic sequences. It is fortunate that such kinds of data are rapidly being generated [48-54] and the corresponding analysis too.