segmentation fault (core dumped)

ComputerCarl

vielen Dank für die schnelle Antwort. Dein Tipp ist sehr hilfreich, denn er zeigt mir wohl den Fehler an. Ich muss ihn nur noch genau finden und verstehen.

terminate called after throuwing an instance of 'std::logic_error'
what(): basic_string::_S_construct null not valid;

jetzt hab ich schon eine Weile rumprobiert. Heisst das: [string a = "";] ist keine gültige Belegung? Der Fehler wird doch wohl woanders liegen.

SeppJ

Zeig Code!

Heisst das: [string a = "";] ist keine gültige Belegung?

Nein, das ist absolut in Ordnung. Die Fehlermeldung klingt eher so, als würde ein String mit einem Nullzeiger initialisiert.

ComputerCarl

Da das Problem ausschließlich bei "größen" Daten auftritt, könnte das Problem mit den Referenzen damit in Verbindung stehen? Also dass sie irgendwie überschrieben oder ungültig werden? Also mit kleinen und großen Datenmengen meine ich klein 1,01MB und groß 11,42MB.

Code ist ziemlich lang. ich weiß nicht ob sich das wer antuen will

#include<iostream>
#include<fstream>
#include<vector>
#include<limits>
#include<sstream>
#include<random>
#include<map>
#include<math.h>

using namespace std;

// Gipps.exe Promotor.txt Motiflength Init It Burn_In pseudo_counts order_of_background
// pos3 hat wohl beim ueberlauf einfluss auf start_pos warum auch immer
class OOPS {

public :
        vector<string>*         data;                 // Data-Sequences
        int                     max_data_length;      // max-length of the Sequences (for faster calculations)

        char                    alphabet[25];         // Alphabet - DNA/RNA/Protein (all possible)
        int                     alphabet_size;

        double                  absolut_min;
        double                  motif_prob;
        double                  strain_prob;

        int                     motif_length;
        double                  foreground   [10000];   // <= Bei PWM motif_length < 2500  
        int                     order_bg;               // Background-Parameters (hMM-D)
        double                  background[1000000];    // ca. hMM-8

        vector<unsigned short>  start_pos;
        vector<char>            pos_strain;
        vector<double>          log_likelihood;

private :

        /* 
           procedure to calculate the sum of logarithmic values presented in a vector

           (1) vector<double> &summands  -> vector of logarithmic values
           (2) double find_max           -> maximal value in summands

           - it is necessary to identify the max-value of the vector-components/summands
           - because of speed limitations max-value is calculated outside this function
        */        
        double log_of_sums(vector<double> &summands, double find_max, unsigned short summands_size) {
               double res = 0;
               for(unsigned short i = 0; i < summands_size; i++)
               res += exp(summands[i] - find_max);
               res = log(res) + find_max;
               return res;
        }

        /*
           - simple method for hashing a character to the entry of its probability vector
           - even though the data-strings could be stored as a integer-vector and therefore
             makes this hashing unpractical, test have shown this method is faster
        */
        int hash(char a) {
            for(unsigned short i = 0; i < this->alphabet_size; i++) {
                    if(a == this->alphabet[i])
                    return i;
            }       
        }

        /*
           - calculates the gamma-function to a value tmp and returns its logarithmic value
           - only capable for integer but its a good approximation for double
        */
        double log_gamma(double tmp) {
               int x = (int)tmp;
               double num=0;
               while(x>1) num+=log(x--);
               return num;
        }

        /*
           - samples parameters for the initial forground
           - the parameters are sampled from a Dirichlet-Distribution which can
             be derived from the samplings of a Gamma-Distribution
           - values under 1 (namely 0.3) makes the sampling results very variable
             this provides a less uniform character of the parameters and there
             makes better inital parameters
        */
        void init_foreground(default_random_engine &generator) {
               gamma_distribution<double> distribution(0.3, 1.0);
               for(unsigned short i = 0; i < this->motif_length; i++) {
                       double sum = 0;
                       for(unsigned short k = 0; k < this->alphabet_size; k++) {
                               this->foreground[i * this->alphabet_size + k] = distribution(generator);
                               sum += this->foreground[i * this->alphabet_size + k];
                       }
                       for(unsigned short k = 0; k < this->alphabet_size; k++) {
                       this->foreground[i * this->alphabet_size + k] /= sum;
                       this->foreground[i * this->alphabet_size + k] = log(this->foreground[i * this->alphabet_size + k]);
                       }
               }
        }

        /*
           - calculates the background-modell for the given data-set
           - 
        */
        void calc_background() {
                map<string, double> hash_table;
                map<int, double> hash_prob;

                vector<string> pseudo_count_keys(1, "");
                // Hash-Table with Data-Counts
                for(int i = 0; i < (*this->data).size(); i++) {
                        for(int j = 1; j <= (*this->data)[i].size(); j++) {
                                if(j <= this->order_bg)
                                hash_table[(*this->data)[i].substr(0, j)]++;
                                else
                                hash_table[(*this->data)[i].substr(j-this->order_bg, this->order_bg)]++;
                        }
                }
                // Hash-Table with normalized Probabilities
                map<string, double>::iterator it;
                for (it=hash_table.begin(); it!=hash_table.end(); ++it) {
                    string prefix = it->first.substr(0, it->first.size() - 1);

                    double sum = 0;
                    for(int a = 0; a < this->alphabet_size; a++)
                    sum += hash_table[prefix+this->alphabet[a]];

                    for(int a = 0; a < this->alphabet_size; a++) {
                            int key = 0;
                            if(prefix.size() > 0) {
                                    for(int t = 0; t < prefix.size(); t++) {
                                    key += (this->hash(prefix[prefix.size()-t-1])+1)*(int)pow(10, t);
                                    }
                                    key *= 10;
                            }
                            key += this->hash(this->alphabet[a])+1;
                            hash_prob[key] = log(hash_table[prefix+this->alphabet[a]] / sum);
                    }
                }
                vector<double> background;
                map<int, double>::iterator it2;
                for(int i = 0; i <= this->order_bg; i++) {
                        for (it2=hash_prob.begin(); it2 !=hash_prob.end(); ++it2) {
                            background.push_back(it2->second);
                        }
                }
                for(int i = 0; i < background.size(); i++) {
                        this->background[i] = background[i];
                }
        }

        void sequence_prob(unsigned short i, int temp, vector<double> *result_for, vector<double> *result_rev, double *tmp_max, double *tmp_max2) {               
               for(unsigned short u_i = 0; u_i < (*this->data)[i].size() - this->motif_length; u_i++) {
                  (*result_for)[u_i] = 0;
                  (*result_rev)[u_i] = 0;
                  double test = 0;
                  int word = 0;
                  // left-flanking-region
                  for(unsigned short l = 1; l <= u_i; l++) {
                       if(l <= this->order_bg) {
                             word    *= this->alphabet_size;
                             word    += this->hash((*this->data)[i][l-1]);
                             (*result_for)[u_i] += this->background[word];
                             (*result_rev)[u_i] += this->background[word];
                       }
                       else {  
                             word    %= temp;
                             word    *= this->alphabet_size;
                             word    += this->hash((*this->data)[i][l-1]);
                             (*result_for)[u_i] += this->background[word];
                             (*result_rev)[u_i] += this->background[word];
                       }
                  }
                  // motif and reverse motif
                  for(unsigned short l = u_i; l < (u_i + this->motif_length); l++) {
                  (*result_for)[u_i] += this->foreground[(l - u_i) * this->alphabet_size + this->hash((*this->data)[i][l])];
                  (*result_rev)[u_i] += this->foreground[((u_i + this->motif_length)-1-l) * this->alphabet_size +this->alphabet_size - 1 - this->hash((*this->data)[i][l])];
                  }
                  // right-flanking region
                  word = 0;
                  for(unsigned short l = u_i + this->motif_length+1; l <= (*this->data)[i].size(); l++) {                       
                       if(l - (u_i + this->motif_length) <= this->order_bg) {
                            word    *= this->alphabet_size;
                            word    += this->hash((*this->data)[i][l-1]);                            
                            (*result_for)[u_i] += this->background[word];
                            (*result_rev)[u_i] += this->background[word];
                       }
                       else {
                            word    %= temp;
                            word    *= this->alphabet_size;
                            word    += this->hash((*this->data)[i][l-1]);
                            (*result_for)[u_i] += this->background[word];
                            (*result_rev)[u_i] += this->background[word];
                       }
                  }
                  // start-position-prior
                  (*result_for)[u_i] += log((double)1./((*this->data)[i].size()-this->motif_length));
                  (*result_rev)[u_i] += log((double)1./((*this->data)[i].size()-this->motif_length));

                  if((*tmp_max) < (*result_for)[u_i])
                  (*tmp_max) = (*result_for)[u_i];
                  if((*tmp_max2) < (*result_rev)[u_i])
                  (*tmp_max2)= (*result_rev)[u_i];
               }
        }

        void calculate_all_gammas(vector<double> *gamma_for, vector<double> *gamma_rev, vector<double> *gamma_zer, int temp, vector<double> *joint_dist) {                              
             vector<double> log_seq_prob_for((this->max_data_length - this->motif_length), 0);
             vector<double> log_seq_prob_rev((this->max_data_length - this->motif_length), 0);
             double tmp_max_for, tmp_max_rev, divisor_for, divisor_rev;

             for(unsigned short i = 0; i < (*this->data).size(); i++) { 
                          tmp_max_for = this->absolut_min;
                          tmp_max_rev = this->absolut_min;
                          this->sequence_prob(i, temp, &log_seq_prob_for, &log_seq_prob_rev, &tmp_max_for, &tmp_max_rev);                     
                          divisor_for  = this->log_of_sums(log_seq_prob_for, tmp_max_for, log_seq_prob_for.size());
                          (*joint_dist)[i*3+1] = divisor_for + this->motif_prob + this->strain_prob;
                          divisor_rev  = this->log_of_sums(log_seq_prob_rev, tmp_max_rev, log_seq_prob_rev.size());                 
                          (*joint_dist)[i*3+2] = divisor_rev + this->motif_prob + log(1-exp(this->strain_prob));
                          for(unsigned short u_i = 0; u_i < ((*this->data)[i].size()-this->motif_length); u_i++) {
                                  (*gamma_for)[i*(this->max_data_length-this->motif_length)+u_i] = log_seq_prob_for[u_i] - divisor_for;
                                  (*gamma_rev)[i*(this->max_data_length-this->motif_length)+u_i] = log_seq_prob_rev[u_i] - divisor_rev;
                          }
                          // ZOOPS!!!
                          (*gamma_zer)[i] = 0;                
                          int word = 0;
                          for(unsigned short l = 1; l <= (*this->data)[i].size(); l++) {
                                  if(l <= this->order_bg) {
                                  word    *= this->alphabet_size;
                                  word    += this->hash((*this->data)[i][l-1]);
                                  (*gamma_zer)[i]  += this->background[word];
                                  }
                                  else {  
                                  word    %= temp;
                                  word    *= this->alphabet_size;
                                  word    += this->hash((*this->data)[i][l-1]);
                                  (*gamma_zer)[i]  += this->background[word];
                                  }
                          }
                          (*joint_dist)[i*3+0] = (*gamma_zer)[i] + log(1 - exp(this->motif_prob));
                          // ZOOPS!!!
             }             
        }

        void sample_from_gammas (vector<double> *gamma_for, vector<double> *gamma_rev, vector<double> *gamma_zer, vector<double> *joint_dist) {

             vector<double> tmp_vec(3, 0);
             double sum, r, m, tmp_max, res, res2;

             for(unsigned short i = 0; i < (*this->data).size(); i++) {
                     m = log((double) rand() / (RAND_MAX));    // 0 <= r < 1
                     tmp_max    = (*joint_dist)[i*3+0];
                     tmp_vec[0] = (*joint_dist)[i*3+0];
                     for(int s = 1; s < 3; s++) {
                           if((*joint_dist)[i*3+s] > tmp_max)
                           tmp_max    = (*joint_dist)[i*3+s];
                           tmp_vec[s] = (*joint_dist)[i*3+s];
                     }
                     sum = this->log_of_sums(tmp_vec, tmp_max, tmp_vec.size());
                     // ZOOPS
                     if(m < (*joint_dist)[i*3+0] - sum ) {
                          this->start_pos[i] = (*data)[i].size()-1;
                          this->pos_strain[i]= 'z';
                     }
                     else {
                          // OOPS
                          if(exp(m) < exp((*joint_dist)[i*3+0]-sum) + exp((*joint_dist)[i*3+1]-sum)) {
                               r = log((double) rand() / (RAND_MAX));                // 0 <= r < 1
                               tmp_max = (*gamma_for)[i*(this->max_data_length-this->motif_length)+0];
                               res = (*gamma_for)[i*(this->max_data_length-this->motif_length)+0];
                               for(unsigned short u_i = 0; u_i < ((*this->data)[i].size()-this->motif_length); u_i++ ) {
                                  if(u_i > 0) {                                       
                                       if(r > res) {
                                            tmp_max = res;
                                            if(tmp_max < (*gamma_for)[i*(this->max_data_length-this->motif_length)+u_i])
                                               tmp_max = (*gamma_for)[i*(this->max_data_length-this->motif_length)+u_i];
                                           res  = exp(res - tmp_max) + exp((*gamma_for)[i*(this->max_data_length-this->motif_length)+u_i] - tmp_max);
                                           res = log(res) + tmp_max;       
                                       }
                                       else {
                                            this->start_pos[i] = u_i-1;
                                            this->pos_strain[i]= 'f';                                
                                            u_i = (*this->data)[i].size() - this->motif_length;
                                       }
                                  }
                               }
                          }
                          // Rev-OOPS
                          else {
                               r = log((double) rand() / (RAND_MAX));                // 0 <= r < 1
                               tmp_max = (*gamma_rev)[i*(this->max_data_length-this->motif_length)+0];
                               res = (*gamma_rev)[i*(this->max_data_length-this->motif_length)+0];
                               for(unsigned short u_i = 0; u_i < ((*this->data)[i].size()-this->motif_length); u_i++ ) {
                                  if(u_i > 0) {
                                       if(r > res) {
                                            tmp_max = res;
                                            if(tmp_max < (*gamma_rev)[i*(this->max_data_length-this->motif_length)+u_i])
                                               tmp_max = (*gamma_rev)[i*(this->max_data_length-this->motif_length)+u_i];
                                           res  = exp(res - tmp_max) + exp((*gamma_rev)[i*(this->max_data_length-this->motif_length)+u_i] - tmp_max);
                                           res = log(res) + tmp_max;       
                                       }
                                       else {
                                            this->start_pos[i] = u_i-1;
                                            this->pos_strain[i]= 'r';
                                            u_i = (*this->data)[i].size() - this->motif_length;
                                       }
                                  }
                               }
                          }                        
                     } // end-else
             }
        }

        void sample_foreground(double pseudocounts, vector<double> *tmp_para, default_random_engine &generator) {
             // Count all Counts and Pseudocounts
             vector<double> counts_all(this->motif_length*this->alphabet_size, pseudocounts);
             for(unsigned short i = 0; i < (*this->data).size(); i++)
                for(unsigned short j = 0; j < this->motif_length; j++)
                   if(this->start_pos[i] < (*this->data)[i].size() - this->motif_length && this->start_pos[i] >= 0) {
                          if(this->pos_strain[i] == 'f')
                          { counts_all[j*this->alphabet_size+this->hash((*this->data)[i][this->start_pos[i] + j])]++; }
                          if(this->pos_strain[i] == 'r')
                          { counts_all[(this->motif_length-1-j)*this->alphabet_size+this->alphabet_size-1-this->hash((*this->data)[i][this->start_pos[i] + j])]++; }
                   }
             // Sample Parameters
             for(unsigned short l = 0; l < this->motif_length; l++) {
                double sum = 0;
                for(unsigned short a = 0; a <this->alphabet_size; a++) {
                        gamma_distribution<double> distribution(counts_all[l*this->alphabet_size+a],1.0);
                        (*tmp_para)[a]  = distribution(generator);
                        sum         += (*tmp_para)[a];
                }
                for(unsigned short a = 0; a < (*tmp_para).size(); a++)
                this->foreground[l * this->alphabet_size + a] = log((*tmp_para)[a] / sum);
             }
        }

        void calc_start_pos() {
           double log_likelihood = 0;
           int    temp = (int)pow(this->alphabet_size, this->order_bg-1);
           vector<double> gamma_for ((*this->data).size()*(this->max_data_length-this->motif_length),0);
           vector<double> gamma_rev ((*this->data).size()*(this->max_data_length-this->motif_length),0);
           vector<double> gamma_zer ((*this->data).size(),0);
           vector<double> joint_dist((*this->data).size() * 3, 0);

           calculate_all_gammas(&gamma_for, &gamma_rev, &gamma_zer, temp, &joint_dist);
           for(int i = 0; i < (*this->data).size(); i++) {
                   double max_value = this->absolut_min;
                   for(int u_i = 0; u_i < (*this->data)[i].size() - this->motif_length; u_i++) {   
                      if(joint_dist[i*3+1] >= joint_dist[i*3+2] && joint_dist[i*3+1] >= joint_dist[i*3+0])
                           if(gamma_for[i*(this->max_data_length-this->motif_length)+u_i] > max_value) {
                              max_value = gamma_for[i*(this->max_data_length-this->motif_length)+u_i];
                              this->start_pos[i] = u_i;
                           }

                      if(joint_dist[i*3+2] >= joint_dist[i*3+1] && joint_dist[i*3+2] >= joint_dist[i*3+0])
                           if(gamma_rev[i*(this->max_data_length-this->motif_length)+u_i] > max_value) {
                              max_value = gamma_rev[i*(this->max_data_length-this->motif_length)+u_i];
                              this->start_pos[i] = u_i;
                           }       
                   }
                   if(joint_dist[i*3+0] >= joint_dist[i*3+2] && joint_dist[i*3+0] >= joint_dist[i*3+1])
                   if(gamma_zer[i] > max_value) {
                              max_value = gamma_rev[i];
                              this->start_pos[i] = (*this->data)[i].size();
                   }
           }
       }

      double calc_bayes_factor (vector<unsigned short>& sample, vector<char>& orientation,   double pseudocounts, int iteration, int burn_in ) {

           vector<double> occurences(this->motif_length * this->alphabet_size, pseudocounts);
           vector<double> sum(this->motif_length, 0);
           for(int i = 0; i < this->start_pos.size(); i++) {
                   if(this->start_pos[i] < (*this->data)[i].size() - this->motif_length + 1) {
                   if(this->pos_strain[i] == 'f')
                   for(int w = 0; w < this->motif_length; w++)
                   occurences[w * this->alphabet_size + this->hash((*this->data)[i][this->start_pos[i] + w])]++;
                   if(this->pos_strain[i] == 'r')
                   for(int w = 0; w < this->motif_length; w++)
                   occurences[(this->motif_length-1-w) * this->alphabet_size + this->alphabet_size-1-this->hash((*this->data)[i][this->start_pos[i] + w])]++;   
                   }
           }
           for(int w = 0; w < this->motif_length; w++) {
                   for(int a = 0; a < this->alphabet_size; a++)
                   sum[w] += occurences[w * this->alphabet_size + a];
                   for(int a = 0; a < this->alphabet_size; a++)
                   this->foreground[w * this->alphabet_size + a] = log(occurences[w * this->alphabet_size + a] / sum[w]);
           }

           vector<double> gamma_for ((*this->data).size()*(this->max_data_length-this->motif_length),0);
           vector<double> gamma_rev ((*this->data).size()*(this->max_data_length-this->motif_length),0);
           vector<double> gamma_zer ((*this->data).size(),0);
           vector<double> joint_dist((*this->data).size() * 3, 0);
           int    temp = (int)pow(this->alphabet_size, this->order_bg-1);
           this->calculate_all_gammas(&gamma_for, &gamma_rev, &gamma_zer, temp, &joint_dist);

           // Likelihood
           double log_likelihood = 0;
           for(unsigned short i = 0; i < (*this->data).size(); i++)
           log_likelihood += joint_dist[i*3+0] + joint_dist[i*3+1] + joint_dist[i*3+2];
           // Prior
           double prior = 0;
           for(unsigned short l = 0; l < this->motif_length; l++) {
                        prior += this->log_gamma(this->alphabet_size*pseudocounts);
                        prior -= this->alphabet_size*this->log_gamma(pseudocounts);
                        for(unsigned short a = 0; a < this->alphabet_size; a++)
                        prior += (pseudocounts - 1)*this->foreground[l * this->alphabet_size + a];
           }
           // Posterior -> Rao-Blackwell-approxomation
           vector<double> posterior_all;
           double         max_value = this->absolut_min;
           double         posterior;

           for(int it = iteration - burn_in; it < iteration; it++) {
                   vector<double> occurences2(this->motif_length * this->alphabet_size, pseudocounts);
                   vector<double> sum2(this->motif_length, 0);
                   for(int i = 0; i < this->start_pos.size(); i++) {
                           if(sample[i] < (*this->data)[i].size() - this->motif_length + 1) {
                           if(orientation[it*(*this->data).size() + i] == 'f')
                           for(int w = 0; w < this->motif_length; w++)
                           occurences2[w * this->alphabet_size + this->hash((*this->data)[i][sample[it*(*this->data).size() + i] + w])]++;
                           if(orientation[it*(*this->data).size() + i] == 'r')
                           for(int w = 0; w < this->motif_length; w++)
                           occurences2[(this->motif_length-1-w) * this->alphabet_size + this->alphabet_size-1-this->hash((*this->data)[i][sample[it*(*this->data).size() + i] + w])]++;   
                           }
                   }
                   for(int w = 0; w < this->motif_length; w++)
                   for(int a = 0; a < this->alphabet_size; a++)
                   sum2[w] += occurences2[w * this->alphabet_size + a];
                   double posterior = 0;
                   for(int w = 0; w < this->motif_length; w++) {
                           double tmp_posterior = 0; 
                           tmp_posterior += this->log_gamma(sum2[w]);
                           for(unsigned short a = 0; a < this->alphabet_size; a++)
                           tmp_posterior -= this->log_gamma(occurences2[w * this->alphabet_size + a]);
                           for(unsigned short a = 0; a < this->alphabet_size; a++)
                           tmp_posterior += (occurences2[w * this->alphabet_size + a]-1)*this->foreground[w * this->alphabet_size + a];
                           if(tmp_posterior <= 0)
                           posterior += tmp_posterior;
                   }
                   posterior_all.push_back(posterior);
                   if(max_value < posterior)
                   max_value = posterior;
           }
           posterior = this->log_of_sums(posterior_all, max_value, posterior_all.size());
           posterior -= log(iteration - burn_in);

           // Chib-approximation
           return log_likelihood + prior - posterior;
      }

      void make_shift(double pseudocounts, vector<double> *tmp_para, default_random_engine& generator, int t, int burn_in, double *log_likelihood) {

           vector<unsigned short> real_start = this->start_pos;
           vector<char>           real_strain= this->pos_strain;
           double                 real_foreground[10000];
           for(unsigned short i = 0; i < this->motif_length; i++)
           for(unsigned short k = 0; k < this->alphabet_size; k++)
           real_foreground[i * this->alphabet_size + k] = this->foreground[i * this->alphabet_size + k];

           double max;

           vector<double> gamma_for ((*this->data).size()*(this->max_data_length-this->motif_length),0);
           vector<double> gamma_rev ((*this->data).size()*(this->max_data_length-this->motif_length),0);
           vector<double> gamma_zer ((*this->data).size(),0);
           vector<double> joint_dist((*this->data).size() * 3, 0);
           int    temp = (int)pow(this->alphabet_size, this->order_bg-1);

           for(int s = 0; s < 5; s++) {
                   // shift right
                   max = 0;
                   vector<char> tmp = this->pos_strain;
                   for(int i = 0; i < (*this->data).size(); i++) {
                           if(this->pos_strain[i] == 'f') { this->start_pos[i] += s; }
                           if(this->pos_strain[i] == 'r') { this->start_pos[i] -= s; }
                           if(this->start_pos[i] < 0 || this->start_pos[i] >= (*this->data)[i].size() - this->motif_length)
                           { this->pos_strain[i] = 'z'; this->start_pos[i] = (*this->data)[i].size()-1; }
                   }
                   this->sample_foreground(pseudocounts, tmp_para, generator);
                   this->calculate_all_gammas(&gamma_for, &gamma_rev, &gamma_zer, temp, &joint_dist);
                   for(unsigned short i = 0; i < (*this->data).size(); i++)
                   max += joint_dist[i*3+0] + joint_dist[i*3+1] + joint_dist[i*3+2];
                   if(max > *log_likelihood) {
                         *log_likelihood = max;
                         for(unsigned short i = 0; i < this->motif_length; i++)
                         for(unsigned short k = 0; k < this->alphabet_size; k++)
                         real_foreground[i * this->alphabet_size + k] = this->foreground[i * this->alphabet_size + k];
                         real_start = this->start_pos;
                         real_strain= this->pos_strain;
                   }
                   // shift left
                   max = 0;
                   this->pos_strain = tmp;
                   for(int i = 0; i < (*this->data).size(); i++) {
                           if(this->pos_strain[i] == 'f') { this->start_pos[i] -= 2*s; }
                           if(this->pos_strain[i] == 'r') { this->start_pos[i] += 2*s; }
                           if(this->start_pos[i] < 0 || this->start_pos[i] >= (*this->data)[i].size() - this->motif_length)
                           { this->pos_strain[i] = 'z'; this->start_pos[i] = (*this->data)[i].size()-1; }
                   }
                   this->sample_foreground(pseudocounts, tmp_para, generator);
                   this->calculate_all_gammas(&gamma_for, &gamma_rev, &gamma_zer, temp, &joint_dist);
                   for(unsigned short i = 0; i < (*this->data).size(); i++)
                   max += joint_dist[i*3+0] + joint_dist[i*3+1] + joint_dist[i*3+2];
                   if(max > *log_likelihood) {
                         *log_likelihood = max;
                         for(unsigned short i = 0; i < this->motif_length; i++)
                         for(unsigned short k = 0; k < this->alphabet_size; k++)
                         real_foreground[i * this->alphabet_size + k] = this->foreground[i * this->alphabet_size + k];
                         real_start = this->start_pos;
                         real_strain= this->pos_strain;
                   }
                   // return shift
                   for(int i = 0; i < (*this->data).size(); i++) {
                           if(this->pos_strain[i] == 'f') { this->start_pos[i] += s; }
                           if(this->pos_strain[i] == 'r') { this->start_pos[i] -= s; }
                   }
           }
           this->start_pos = real_start;
           this->pos_strain= real_strain;

           for(unsigned short i = 0; i < this->motif_length; i++)
           for(unsigned short k = 0; k < this->alphabet_size; k++)
           this->foreground[i * this->alphabet_size + k] = real_foreground[i * this->alphabet_size + k];
      }

public :

       OOPS(vector<string> *data, int motif_length, vector<string> alphabet1, int bg_order, default_random_engine &generator) {
               this->alphabet_size = alphabet1.size();
               for(int i = 0; i < this->alphabet_size; i++)
               this->alphabet[i]    = alphabet1[i][0];

               this->absolut_min = -1 * numeric_limits<double>::max();
               this->motif_prob  = log(0.9);
               this->strain_prob = log(0.5);

               this->data        = data;
               this->motif_length= motif_length;
               this->order_bg    = bg_order; 

               this->max_data_length = 0;
               for(int i = 0; i < (*this->data).size(); i++)
               if((*this->data)[i].size() > max_data_length)
               this->max_data_length = (*this->data)[i].size();

               vector<unsigned short> pos((*this->data).size(), 0);
               this->start_pos = pos;
               vector<char> strain((*this->data).size(), 'f');
               this->pos_strain= strain;

               this->init_foreground(generator);
               this->calc_background();
       };

       double gipps(int iteration, int id, int burn_in, double pseudocounts, default_random_engine& generator, double& max) {
           this->init_foreground(generator);
           vector<double> log_likelihood(iteration, 0);
           vector<double> tmp_para(this->alphabet_size, 0);
           vector<double> gamma_for ((*this->data).size()*(this->max_data_length-this->motif_length),0);
           vector<double> gamma_rev ((*this->data).size()*(this->max_data_length-this->motif_length),0);
           vector<double> gamma_zer ((*this->data).size(),0);
           vector<double> joint_dist((*this->data).size() * 3, 0);
           vector<unsigned short>  sample(iteration * (*this->data).size(), 0);
           vector<char>            orientation(iteration * (*this->data).size(), 'z');
           int    temp = (int)pow(this->alphabet_size, this->order_bg-1);
           for(unsigned short t = 0; t < iteration; t++) {
                   cout << t << "\t";
                   this->calculate_all_gammas(&gamma_for, &gamma_rev, &gamma_zer, temp, &joint_dist);
                   this->sample_from_gammas  (&gamma_for, &gamma_rev, &gamma_zer, &joint_dist);
                   this->sample_foreground   (pseudocounts, &tmp_para, generator);                   
                   for(unsigned short i = 0; i < (*this->data).size(); i++) {
                           log_likelihood[t] += joint_dist[i*3+0] + joint_dist[i*3+1] + joint_dist[i*3+2];
                           sample[t*(*this->data).size() + i] = this->start_pos[i];
                           orientation[t*(*this->data).size() + i] = this->pos_strain[i];
                   }
                   if( (t % 20 == 0 && t < burn_in && t > 0) || t+1 == burn_in)
                   this->make_shift(pseudocounts, &tmp_para, generator, t, burn_in, &log_likelihood[t]);
                   if( max < log_likelihood[t] ) {
                   this->print_motif(this->start_pos, this->pos_strain, "POS-1.txt", pseudocounts);
                   max = log_likelihood[t];
                   }
           }
           cout << endl;
           double bayes_factor = this->calc_bayes_factor(sample, orientation, pseudocounts, iteration, burn_in);
           this->log_likelihood = log_likelihood;
           return bayes_factor;
      }

      void print_motif(vector<unsigned short> &tmp_start_pos, vector<char> &strain, string name, double pseudocounts) {
           vector<double> tmp_foreground(this->motif_length * this->alphabet_size, pseudocounts);
           for(int i = 0; i < tmp_start_pos.size(); i++) {
                   if(strain[i] == 'f')
                   for(int w = 0; w < this->motif_length; w++)
                   tmp_foreground[w * this->alphabet_size + this->hash((*this->data)[i][tmp_start_pos[i] + w])]++;
                   if(strain[i] == 'r')
                   for(int w = 0; w < this->motif_length; w++)
                   tmp_foreground[(this->motif_length-1-w) * this->alphabet_size + this->alphabet_size-1-this->hash((*this->data)[i][tmp_start_pos[i] + w])]++;   
           }
           for(int w = 0; w < this->motif_length; w++) {
                   double tmp_sum = 0;
                   for(int a = 0; a < this->alphabet_size; a++)
                   tmp_sum += tmp_foreground[w * this->alphabet_size + a];
                   for(int a = 0; a < this->alphabet_size; a++)
                   tmp_foreground[w * this->alphabet_size + a] = log(tmp_foreground[w * this->alphabet_size + a] / tmp_sum);
           }
           fstream result;
           result.open(name.c_str(), ios::out);
           for(int i = 0; i < this->alphabet_size; i++) {
                   for(int j = 0; j < this->motif_length; j++)
                   result << exp(tmp_foreground[j * this->alphabet_size + i]) << "\t";
                   result << endl;
           }
           result.close();
      }      
};

vector<string> read_string(string path_to_file) {
              ifstream input_txt;
              input_txt.open(path_to_file.c_str(), ios::in);

              vector<string> input;
              string value = "";     
              string input_char = "";

              while(!input_txt.eof()) {
                    input_char = input_txt.get();
                    if(input_char == "\n") {
                           input.push_back(value);
                           value = "";
                    }
                    else
                    { value += input_char; }
              }
              input_txt.close();
              return input;
};

double log_of_sums(vector<double> &summands, double find_max) {
              double res = 0;
              for(unsigned short i = 0; i < summands.size(); i++)
              res += exp(summands[i] - find_max);
              res = log(res) + find_max;
              return res;
};

int main(int argc, char *argv[]) {

   string my_alphabet[] = {"a","c","g","t"};               
   vector<string> alphabet (my_alphabet,my_alphabet+sizeof(my_alphabet)/sizeof(string));
   vector<string>data = read_string(argv[1]);
   int max_data_length = 0;
   for(int i = 0; i < data.size(); i++)
   if(data[i].size() > max_data_length)
   max_data_length = data[i].size();

   vector<vector<double> > lh;
   default_random_engine   generator;
   double                  max_value = -1 * numeric_limits<double>::max();;
   vector<double>          bayes_factors;

   for(int i = 0; i < atoi(argv[3]); i++) {
           cout << "(" << i << ")" << endl;
           OOPS *tmp = new OOPS(&data, atoi(argv[2]), alphabet, atoi(argv[7]), generator);
           bayes_factors.push_back(tmp->gipps(atoi(argv[4]), i, atoi(argv[5]), atof(argv[6]), generator, max_value));
           lh.push_back(tmp->log_likelihood);
   }

   cout << "Bayes-Factors:" << endl;
   double find_max = -1 * numeric_limits<double>::max();
   for(int i = 0; i < bayes_factors.size(); i++) {
           cout << bayes_factors[i] << endl;
           if(find_max < bayes_factors[i])
           find_max = bayes_factors[i];
   }
   cout << " => " << log_of_sums(bayes_factors, find_max) - log(bayes_factors.size()) << endl;

   /////////////////////////////////////////////////////////////////////////////
   // Ausgabe der Likelihoods
   fstream result2;
   string name2 = "Likelihood-gipps.txt";
   result2.open(name2.c_str(), ios::out);
   for(int i = 0; i < lh.size(); i++) {
           for(int j = 0; j < lh[i].size(); j++)
           result2 << lh[i][j] << " ";
           result2 << endl;
   }
   result2.close();

   return 0;
}

SeppJ

Das ist übrigens reines Standard-C++, hat also nichts mit Linux-Systemprogrammierung zu tun.

Das ist wirklich ziemlich viel Code, aber es reicht schon, eine statische Codeanalyse mit dem Compiler(!) durchzuführen, um dutzende kleine und mehrere schwere Fehler zu finden:

test.cc: In member function ‘void OOPS::calc_background()’:
test.cc:111:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                 for(int i = 0; i < (*this->data).size(); i++) {
                                                       ^
test.cc:112:67: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                         for(int j = 1; j <= (*this->data)[i].size(); j++) {
                                                                   ^
test.cc:131:68: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                                     for(int t = 0; t < prefix.size(); t++) {
                                                                    ^
test.cc:147:52: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                 for(int i = 0; i < background.size(); i++) {
                                                    ^
test.cc: In member function ‘void OOPS::sequence_prob(short unsigned int, int, std::vector<double>*, std::vector<double>*, double*, double*)’:
test.cc:156:26: warning: unused variable ‘test’ [-Wunused-variable]
                   double test = 0;
                          ^
test.cc: In member function ‘void OOPS::sample_from_gammas(std::vector<double>*, std::vector<double>*, std::vector<double>*, std::vector<double>*)’:
test.cc:248:46: warning: unused variable ‘res2’ [-Wunused-variable]
              double sum, r, m, tmp_max, res, res2;
                                              ^
test.cc: In member function ‘void OOPS::calc_start_pos()’:
test.cc:347:50: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
            for(int i = 0; i < (*this->data).size(); i++) {
                                                  ^
test.cc:349:75: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                    for(int u_i = 0; u_i < (*this->data)[i].size() - this->motif_length; u_i++) {  
                                                                           ^
test.cc:339:19: warning: unused variable ‘log_likelihood’ [-Wunused-variable]
            double log_likelihood = 0;
                   ^
test.cc: In member function ‘double OOPS::calc_bayes_factor(std::vector<short unsigned int>&, std::vector<char>&, double, int, int)’:
test.cc:375:52: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
            for(int i = 0; i < this->start_pos.size(); i++) {
                                                    ^
test.cc:419:60: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                    for(int i = 0; i < this->start_pos.size(); i++) {
                                                            ^
test.cc: In member function ‘void OOPS::make_shift(double, std::vector<double>*, std::default_random_engine&, int, int, double*)’:
test.cc:475:58: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                    for(int i = 0; i < (*this->data).size(); i++) {
                                                          ^
test.cc:496:58: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                    for(int i = 0; i < (*this->data).size(); i++) {
                                                          ^
test.cc:515:58: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                    for(int i = 0; i < (*this->data).size(); i++) {
                                                          ^
test.cc: In constructor ‘OOPS::OOPS(std::vector<std::basic_string<char> >*, int, std::vector<std::basic_string<char> >, int, std::default_random_engine&)’:
test.cc:545:54: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                for(int i = 0; i < (*this->data).size(); i++)
                                                      ^
test.cc:546:45: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                if((*this->data)[i].size() > max_data_length)
                                             ^
test.cc: In member function ‘void OOPS::print_motif(std::vector<short unsigned int>&, std::vector<char>&, std::string, double)’:
test.cc:594:50: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
            for(int i = 0; i < tmp_start_pos.size(); i++) {
                                                  ^
test.cc: In function ‘int main(int, char**)’:
test.cc:655:33: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
    for(int i = 0; i < data.size(); i++)
                                 ^
test.cc:656:24: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
    if(data[i].size() > max_data_length)
                        ^
test.cc:673:42: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
    for(int i = 0; i < bayes_factors.size(); i++) {
                                          ^
test.cc:685:31: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
    for(int i = 0; i < lh.size(); i++) {
                               ^
test.cc:686:42: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
            for(int j = 0; j < lh[i].size(); j++)
                                          ^
test.cc: In member function ‘int OOPS::hash(char)’:
test.cc:65:9: warning: control reaches end of non-void function [-Wreturn-type]
         }
         ^

Die ganzen sign-compares sind relativ harmlos, wenn du versprichst, dass du darüber nachdenkst, warum das gefährlich sein kann und dann noch, ob der Fall hier auftreten kann.

Die unused variables deuten schwer darauf hin, dass du beim Programmieren etwas falsch gemacht hast. Sie sind oft ein Zeichen für Logikfehler im Code. Guck dir jeden ganz genau an und korrigier sie!

Das Verlassen einer Funktion ohne Rückgabewert ist natürlich eine schwere Todsünde, falls dieser Fall wirklich eintreten kann. Mach dir nochmal Gedanken, ob das wirklich 100% richtig ist.

Nun noch eine menschliche Codeanalyse durch Überfliegen, bei dem ich bestimmt nur einen Bruchteil der Fehler bemerkt habe:

Pointer auf vector<string>?
Überflüssige, selbstverschuldete Speicherlöcher im Bereich von Zeile 666.
Die typischen Logikfehler bei Leseaktionen, die du wohl von einem schlechten Lehrer übernommen hast...
...vermutlich ebenso wie die dilletantische Umwandlung von Text in Zahlen..
...und das noch viel schlimmere Lesen an sich, das sich durch ein bis zwei Zeilen aus der Standardbibliothek ersetzen ließe, die dann auch garantiert funktionieren und keine Fehler enthalten.
Letzteres trifft auf so ziemlich alles im Code zu. Du machst unnötig und fehleranfällig das nach, was es schon lange viel besser gibt. Das machst du sehr häufig! Noch dazu sind deine Ersatzkonstrukte an vielen Stellen tatsächlich fehleranfällig, das ist keine bloße Vermutung. Beispielsweise deine Maximumsuche. Ojeh! Hättest du doch einfach max_element genommen.
Statische Arrays, wo du eigentlich dynamische Arrays möchtest. Einladung für Fehler bei großen Eingabedatenmengen. Das passt doch genau zu einer gewissen Fehlerbeschreibung:

ComputerCarl schrieb:

Da das Problem ausschließlich bei "größen" Daten auftritt

Na, das wird dann wohl der hier beobachtete Fehler sein. Das heißt natürlich nicht, dass der Rest nicht trotzdem falsch wäre

const-correctness ist wohl ein Fremdwort. Nein, const selbst sscheint ein Fremdwort zu sein, es kommt nirgendwo vor, was mir sehr ungewöhnlich vorkommt für ein Programm dieser Länge. Das macht es dem Compiler schwerer, Logikfehler zu finden.
Man muss nicht immer alles in eine Klasse stopfen, bloß weil man objektorientiert programmiert. Insbesondere wenn das einzige Objekt das man hinterher erzeugt den tollen Namen tmp (ein Name der an sich schon ein Alarmzeichen ist, egal in welchem Zusammenhang) trägt. Das deutet doch sehr auf Fehldesign hin.

Wenn du ein Beispiel für eine Datendatei mit Fehler irgendwo hochladen kannst, könnte man noch mehr sagen. Ich muss noch sagen, dass ich deinen Stil schrecklich finde, sowohl die Variablendeklaration am Scopebegin, als auch das explizite this. Ich finde das sehr schwer zu lesen und würde sicherlich noch mehr Fehler finden können, wenn es nur einfacher zu lesen wäre.

ComputerCarl

Ich hatte gerade noch einen Termin und bin gerade erst zurückgekommen. Vielen Dank für deine Analyse! Ich werde alles versuchen zu verstehen und ändern. Leider hatte ich nur 1-mal ein Fach mit Programmierung und es wurden keine "Programmier-Standarts" besprochen, sondern nur wie wir lauffähigen Code produzieren. Hast du vielleicht eine gute Quelle, die mir hilft meinen "schrecklichen" Stil einer allgemeinen Ästhetik anzupassen.
Daten kann ich wohl nicht hochladen, weil ich sie nicht rausgeben dürfte (denke ich). Ich kann aber ein einfaches Sampling-Programm schreiben, welches solche Daten produziert. Das würde ich dann Morgen machen. Heute werde ich versuchen alles angesprochene umzusetzten.
Vielen Dank nochmal für die ausführliche Antwort.

SeppJ

Es wäre halt hilfreich, wenn man den Fehler reproduzieren könnte. Ob mit echten oder mit generierten Daten ist egal. Stell aber sicher, dass mit den Daten auch wirklich der Fehler auftritt!

Stil: Ersteinmal ist lauffähig != funktionierend, wie du gerade feststellst. Es lohnt sich schon, einigermaßen Programmieren zu lernen. Ganz besonders zu Anfang kannst du leicht so viel Nützliches lernen, dass du später damit enorm viel Zeit sparst, viel mehr als du zum Lernen gebraucht hast. Ich weiß jetzt nicht, wie lange du für dein Programm gebraucht hast, aber vermutlich eine ganze Weile, oder? Ich schätze, jemand mit ein paar Monaten Erfahrung und Kenntnis der Standardbibliothek könnte ein gleichartiges Programm in < 200 Zeilen an einem Nachmittag schreiben. Eines, das auch direkt perfekt funktioniert.
http://xkcd.com/1205/

Hast du schon einmal ein Anfängerbuch komplett gelesen? Also ein richtiges Anfängerbuch, mit > 1000 Seiten, das alle Grundlagen erklärt, keine kleine Einführung? Falls nicht, hol dir ein gutes. Vermeide aber unbedingt die vielen schlechten Bücher auf dem deutschen Markt! Siehe den zweiten und vor allem den vierten Link in meiner Signatur. Danach empfiehlt sich noch ein Blick durch die ersten paar Bücher für Fortgeschrittene wie z.B. die von Meyers oder Sutter. Vergiss nicht, dass du nebenbei die ganze Zeit weiterarbeiten kannst. Das übt! Übung ist wichtig! Und du kannst das Forum lesen. Auch selber versuchen, deine Lösung zu Fragen zu zeigen, selbst wenn sie dann kritisiert werden. Das ist ja gerade das Ziel, dass du lernst, was daran nicht gut ist und daher wird hier auch so viel kritisiert (auch wenn es daher oft unhöflich klingt, ist es nicht unbedingt so gemeint, sondern eher als Hilfe). Nach den fortgeschrittenen Büchern solltest du für deine Zwecke genug wissen.

Da du wohl an einer Uni oder ähnlichem bist, gibt es dort bestimmt auch eine Bibliothek mit diesen Büchern.

ComputerCarl

So ich habe die Fehler behoben und nun läuft das Programm eigentlich gut durch. Nachdem was du aber alles gesagt hast, versuche ich es einfach nochmal von Grund auf neu zu machen. Weniger als 200 Zeilen, dass klingt nach einer Herrausforderung. Mal schauen ich bedanke mich für die Hilfe und Kritik.

SeppJ

ComputerCarl schrieb:

Weniger als 200 Zeilen, dass klingt nach einer Herrausforderung.

Die offensichtlichsten Kürzungen wären:
log_of_sums, log_gamma, read_string, zweites log_of_sums, teilweise init_foreground. Diese können gänzlich oder größtenteils durch Einzeiler aus der Standardbibliothek ersetzt werden. Der Rest ist mir, wie gesagt, etwas zu unleserlich, aber prinzipiell sehe ich da viele weitere Stellen, an denen du Standardalgorithmen ausschreibst, anstatt einfach fix und fertige Funktionen zu benutzen. Erfahrungsgemäß machen Anfänger auch viele Dinge, die nicht nötig sind oder unnötig umständlich. Da kann man noch einmal viel kürzen, müsste ich aber drüber nachdenken, was der Code überhaupt genau machen soll. 200 ist schon herausfordernd (halte ich aber für realistisch), 300-400 sollte einfach sein.

ComputerCarl

Hallo nochmal.
Das Thema hat sich wirklich komplett in eine andere Richtung entwickelt und betrifft wohl nur noch C++Programmierung und hat nichts mehr mit Linux oder ähnlichem zu tun. Vielleicht könnte ein Admin dieses Topic verschieben?

Ich habe gestern nochmal alles programmiert und hab mich da von der Kritik leiten lassen. Leider bin ich wie schon längst aufgefallen ist eher ein Anfänger.
Ich würde gerne deshalb den neuen Code anhängen, um somit noch ein paar Fragen zu stellen.

Fragen an SeppJ:
-> "Überflüssige, selbstverschuldete Speicherlöcher im Bereich von Zeile 666."
Das war ja im Ursprungs-Code. Leider weiss ich nicht was du damit meinst. Könntest du das bitte etwas einfacher für micht formulieren?
-> "Die typischen Logikfehler bei Leseaktionen"
Auch das klingt sehr wichtig und ich weiss nicht was du meinst.
-> "dilletantische Umwandlung von Text in Zahlen"
eine andere Umwandlung wäre wohl mit Hilfe von maps oder so. Was wäre da erfahrungsgemäß eine der schnelleren und besten Umwandlungen.

Allgemein muss ich den Code irgendwie schneller hinbekommen, da die Datenmengen die ich bearbeiten muss ziemlich groß sind. Eine Berechnungszeit von bis 3 Tagen wäre o.k.,
aber ich brauche wohl für meinen Datensatz eine Woche. Jeder Tipp wäre hilfreich und bei Fragen zum Code, der Thematik oder der Daten würde ich gerne Auskunft geben.

#include<fstream>
#include<vector>
#include<limits>
#include<sstream>
#include<random>
#include<map>
#include<math.h>
#include<algorithm>
#include<iostream>

using namespace std;

class ds_ZOOPS {

public :
        vector<string>          data;
        vector<char>            alphabet;
        unsigned                max_data_length;

        double                  absolut_min;
        double                  motif_prob;
        double                  strain_prob;

        unsigned short          motif_length;
        double                  foreground[10000];
        unsigned short          order_bg;
        double                  background[1000000];

        vector<unsigned>        start_pos;
        vector<char>            pos_strain;
        vector<double>          log_likelihood;

private :

        vector<string> read_string(string path_to_file) {
              ifstream input_txt;
              input_txt.open(path_to_file.c_str(), ios::in);

              vector<string> input;
              string value = "";     
              string input_char = "";

              while(!input_txt.eof()) {
                    input_char = input_txt.get();
                    if(input_char == "\n") {
                           input.push_back(value);
                           value = "";
                    }
                    else
                    { value += input_char; }
              }
              input_txt.close();
              return input;
        };

        double log_of_sums(vector<double> &summands) {
               const double find_max = *max_element(summands.begin(), summands.end());
               double res = 0;
               for(unsigned i = 0; i < summands.size(); i++)
               res += exp(summands[i] - find_max);
               res = log(res) + find_max;
               return res;
        }

        int hash(char a) {
            for(unsigned short i = 0; i < this->alphabet.size(); i++) {
                    if(a == this->alphabet[i])
                    return i;
            }
            cout << a << " ...is not in the alphabet...\n";
            return 0;      
        }

        void init_foreground(default_random_engine &generator) {
               gamma_distribution<double> distribution(0.3, 1.0);               
               vector<double> tmp_para(this->alphabet.size(), 0);
               for(unsigned l = 0; l < this->motif_length; l++) {
                        for(unsigned short a = 0; a <this->alphabet.size(); a++)
                        tmp_para[a]  = distribution(generator);
                        for(unsigned short a = 0; a < tmp_para.size(); a++)
                        this->foreground[l * this->alphabet.size() + a] = log(tmp_para[a] / accumulate(tmp_para.begin(), tmp_para.end(), 0.0));
               }
        }

        void calc_background() {
                map<string, double> hash_table;
                map<int, double> hash_prob;

                vector<string> pseudo_count_keys(1, "");
                // Hash-Table with Data-Counts
                for(unsigned i = 0; i < this->data.size(); i++) {
                        for(unsigned j = 1; j <= this->data[i].size(); j++) {
                                if(j <= this->order_bg)
                                hash_table[this->data[i].substr(0, j)]++;
                                else
                                hash_table[this->data[i].substr(j-this->order_bg, this->order_bg)]++;
                        }
                }
                // Hash-Table with normalized Probabilities
                map<string, double>::iterator it;
                for (it=hash_table.begin(); it!=hash_table.end(); ++it) {
                    string prefix = it->first.substr(0, it->first.size() - 1);
                    double sum = 0;
                    for(int a = 0; a < this->alphabet.size(); a++)
                    sum += hash_table[prefix+this->alphabet[a]];
                    for(int a = 0; a < this->alphabet.size(); a++) {
                            int key = 0;
                            if(prefix.size() > 0) {
                                    for(unsigned t = 0; t < prefix.size(); t++) 
                                    key += (this->hash(prefix[prefix.size()-t-1])+1)*(int)pow(10, t);
                                    key *= 10;
                            }
                            key += this->hash(this->alphabet[a])+1;
                            hash_prob[key] = log(hash_table[prefix+this->alphabet[a]] / sum);
                    }
                }
                vector<double> background;
                map<int, double>::iterator it2;
                for(int i = 0; i <= this->order_bg; i++) {
                        for (it2=hash_prob.begin(); it2 !=hash_prob.end(); ++it2)
                        background.push_back(it2->second);
                }
                for(unsigned i = 0; i < background.size(); i++)
                this->background[i] = background[i];
        }

        void sequence_prob(int i, vector<double> *result_for, vector<double> *result_rev, double *result_zer) {               
               const int temp = (int)pow(this->alphabet.size(), this->order_bg-1);
               for(unsigned u_i = 0; u_i < this->data[i].size() - this->motif_length; u_i++) {
                  (*result_for)[u_i] = 0;
                  (*result_rev)[u_i] = 0;
                  int word = 0;
                  // left-flanking-region
                  for(unsigned l = 1; l <= u_i; l++) {
                       if(l <= this->order_bg) {
                             word    *= this->alphabet.size();
                             word    += this->hash(this->data[i][l-1]);
                             (*result_for)[u_i] += this->background[word];
                             (*result_rev)[u_i] += this->background[word];
                       }
                       else {  
                             word    %= temp;
                             word    *= this->alphabet.size();
                             word    += this->hash(this->data[i][l-1]);
                             (*result_for)[u_i] += this->background[word];
                             (*result_rev)[u_i] += this->background[word];
                       }
                  }
                  // motif and reverse motif
                  for(unsigned l = u_i; l < (u_i + this->motif_length); l++) {
                  (*result_for)[u_i] += this->foreground[(l - u_i) * this->alphabet.size() + this->hash(this->data[i][l])];
                  (*result_rev)[u_i] += this->foreground[((u_i + this->motif_length)-1-l) * this->alphabet.size() + this->alphabet.size() - 1 - this->hash(this->data[i][l])];
                  }
                  // right-flanking region
                  word = 0;
                  for(unsigned l = u_i + this->motif_length+1; l <= this->data[i].size(); l++) {                       
                       if(l - (u_i + this->motif_length) <= this->order_bg) {
                            word    *= this->alphabet.size();
                            word    += this->hash(this->data[i][l-1]);                            
                            (*result_for)[u_i] += this->background[word];
                            (*result_rev)[u_i] += this->background[word];
                       }
                       else {
                            word    %= temp;
                            word    *= this->alphabet.size();
                            word    += this->hash(this->data[i][l-1]);
                            (*result_for)[u_i] += this->background[word];
                            (*result_rev)[u_i] += this->background[word];
                       }
                  }
                  // start-position-prior
                  (*result_for)[u_i] += log((double)1./(this->data[i].size()-this->motif_length));
                  (*result_rev)[u_i] += log((double)1./(this->data[i].size()-this->motif_length));
               }
               // ZOOPS!!!
               *result_zer = 0;                
               unsigned word = 0;
               for(unsigned l = 1; l <= this->data[i].size(); l++) {
                    if(l <= this->order_bg) {
                         word    *= this->alphabet.size();
                         word    += this->hash(this->data[i][l-1]);
                         *result_zer  += this->background[word];
                    }
                    else {  
                         word    %= temp;
                         word    *= this->alphabet.size();
                         word    += this->hash(this->data[i][l-1]);
                         *result_zer  += this->background[word];
                    }
               }
        }

        void calculate_all_gammas(vector<double> *gamma_for, vector<double> *gamma_rev, vector<double> *gamma_zer, vector<double> *joint_dist) {                              
             vector<double> log_seq_prob_for((this->max_data_length - this->motif_length), 0);
             vector<double> log_seq_prob_rev((this->max_data_length - this->motif_length), 0);
             double         log_seq_prob_zer = 0;
             for(unsigned i = 0; i < this->data.size(); i++) { 
                          this->sequence_prob(i, &log_seq_prob_for, &log_seq_prob_rev, &log_seq_prob_zer);                     
                          const double divisor_for  = this->log_of_sums(log_seq_prob_for);
                          (*joint_dist)[i*3+1] = divisor_for + this->motif_prob + this->strain_prob;
                          const double divisor_rev  = this->log_of_sums(log_seq_prob_rev);                 
                          (*joint_dist)[i*3+2] = divisor_rev + this->motif_prob + log(1-exp(this->strain_prob));
                          for(unsigned u_i = 0; u_i < (this->data[i].size()-this->motif_length); u_i++) {
                                  (*gamma_for)[i*(this->max_data_length-this->motif_length)+u_i] = exp(log_seq_prob_for[u_i] - divisor_for);
                                  (*gamma_rev)[i*(this->max_data_length-this->motif_length)+u_i] = exp(log_seq_prob_rev[u_i] - divisor_rev);
                          }
                          (*gamma_zer)[i]      = log_seq_prob_zer;
                          (*joint_dist)[i*3+0] = (*gamma_zer)[i] + log(1 - exp(this->motif_prob));
             }             
        }

        void sample_from_gammas (vector<double> *gamma_for, vector<double> *gamma_rev, vector<double> *gamma_zer, vector<double> *joint_dist, default_random_engine &generator) {
             vector<double> tmp((this->max_data_length-this->motif_length), 0);
             for(unsigned i = 0; i < this->data.size(); i++) {
                     vector<double> tmp_vec{(*joint_dist)[i*3+0], (*joint_dist)[i*3+1], (*joint_dist)[i*3+2]};
                     const double m = log((double) rand() / (RAND_MAX));                                      // 0 <= r < 1
                     const double sum = this->log_of_sums(tmp_vec);
                     if(m < (*joint_dist)[i*3+0] - sum ) {                                                    //(1) ZOOPS
                          this->start_pos[i] = data[i].size()-1;
                          this->pos_strain[i]= 'z';
                     }
                     else {
                          if(exp(m) < exp((*joint_dist)[i*3+0]-sum) + exp((*joint_dist)[i*3+1]-sum)) {        //(2) OOPS
                               discrete_distribution<> distribution(&(*gamma_for)[i*(this->max_data_length-this->motif_length)+0], &(*gamma_for)[i*(this->max_data_length-this->motif_length)+(this->data[i].size()-this->motif_length)]);
                               this->start_pos[i] = distribution(generator);
                               this->pos_strain[i]= 'f';
                          }
                          else {                                                                              //(3) rev-OOPS
                               discrete_distribution<> distribution(&(*gamma_rev)[i*(this->max_data_length-this->motif_length)+0], &(*gamma_rev)[i*(this->max_data_length-this->motif_length)+(this->data[i].size()-this->motif_length)]);
                               this->start_pos[i] = distribution(generator);
                               this->pos_strain[i]= 'r';
                          }                        
                     }
             }
        }

        void sample_foreground(double pseudocounts, default_random_engine &generator) {
             // Count all Counts and Pseudocounts
             vector<double> counts_all(this->motif_length*this->alphabet.size(), pseudocounts);
             for(unsigned i = 0; i < this->data.size(); i++)
                for(unsigned j = 0; j < this->motif_length; j++)
                   if(this->start_pos[i] < this->data[i].size() - this->motif_length && this->start_pos[i] >= 0) {
                          if(this->pos_strain[i] == 'f')
                          { counts_all[j*this->alphabet.size()+this->hash(this->data[i][this->start_pos[i] + j])]++; }
                          if(this->pos_strain[i] == 'r')
                          { counts_all[(this->motif_length-1-j)*this->alphabet.size()+this->alphabet.size()-1-this->hash(this->data[i][this->start_pos[i] + j])]++; }
                   }
             // Sample Parameters
             vector<double> tmp_para(this->alphabet.size(), 0);
             for(unsigned l = 0; l < this->motif_length; l++) {
                for(unsigned short a = 0; a < this->alphabet.size(); a++) {
                        gamma_distribution<double> distribution(counts_all[l*this->alphabet.size()+a],1.0);
                        tmp_para[a]  = distribution(generator);
                }
                for(unsigned short a = 0; a < tmp_para.size(); a++)
                this->foreground[l*this->alphabet.size()+a] = log(tmp_para[a] / accumulate(tmp_para.begin(), tmp_para.end(), 0.0));
             }
        }

      void make_shift(double pseudocounts, default_random_engine& generator, unsigned short t, unsigned short burn_in, double *log_likelihood) {

           vector<unsigned>       real_start = this->start_pos;
           vector<char>           real_strain= this->pos_strain;
           double                 real_foreground[10000];
           copy(&this->foreground[0], &this->foreground[10000], real_foreground);

           vector<double> gamma_for (this->data.size()*(this->max_data_length-this->motif_length),0);
           vector<double> gamma_rev (this->data.size()*(this->max_data_length-this->motif_length),0);
           vector<double> gamma_zer (this->data.size(),0);
           vector<double> joint_dist(this->data.size() * 3, 0);

           for(short s = -2; s < 2; s++) {
                   if(s == 0)
                   s++;
                   // shift
                   double max = 0;
                   for(unsigned i = 0; i < this->data.size(); i++) {
                         if(this->pos_strain[i] == 'f') { this->start_pos[i] += s; }
                         if(this->pos_strain[i] == 'r') { this->start_pos[i] -= s; }
                         if(this->start_pos[i] < 0 || this->start_pos[i] >= this->data[i].size() - this->motif_length)
                         { this->pos_strain[i] = 'z'; this->start_pos[i]  = this->data[i].size()-1; }
                   }
                   this->sample_foreground(pseudocounts, generator);
                   this->calculate_all_gammas(&gamma_for, &gamma_rev, &gamma_zer, &joint_dist);
                   for(unsigned i = 0; i < this->data.size(); i++) {
                         vector<double> tmp_vec {joint_dist[i*3+0], joint_dist[i*3+1], joint_dist[i*3+2]};
                         max += this->log_of_sums(tmp_vec);
                         // return shift
                         if(this->pos_strain[i] == 'f') { this->start_pos[i] -= s; }
                         if(this->pos_strain[i] == 'r') { this->start_pos[i] += s; }
                   }
                   if(max > *log_likelihood) {
                         *log_likelihood = max;
                         copy(&this->foreground[0], &this->foreground[10000], real_foreground);
                         real_start = this->start_pos;
                         real_strain= this->pos_strain;
                   }
           }
           //this->start_pos = real_start;
           //this->pos_strain= real_strain;
           copy(&real_foreground[0], &real_foreground[10000], this->foreground);
      }

public :

       ds_ZOOPS(string path_to_file, unsigned short motif_length, vector<char> alphabet, unsigned short bg_order, default_random_engine &generator, unsigned short iteration) {
               this->data = this->read_string(path_to_file);
               this->alphabet    = alphabet;
               this->motif_length= motif_length;
               this->order_bg    = bg_order; 
               this->absolut_min = -1 * numeric_limits<double>::max();
               this->motif_prob  = log(0.9);
               this->strain_prob = log(0.5);

               this->max_data_length = 0;
               for(unsigned i = 0; i < this->data.size(); i++)
               if(this->data[i].size() > max_data_length)
               this->max_data_length = this->data[i].size();

               vector<unsigned> pos(this->data.size(), 0);
               this->start_pos      = pos;
               vector<char>     strain(this->data.size(), 'f');
               this->pos_strain     = strain;
               vector<double>   log_likelihood(iteration, 0);
               this->log_likelihood = log_likelihood;

               this->init_foreground(generator);
               this->calc_background();
       };

       void gibbs(unsigned short iteration, unsigned short id, unsigned short burn_in, double pseudocounts, default_random_engine& generator, double& max) {
           this->init_foreground(generator);
           vector<double>   gamma_for (this->data.size()*(this->max_data_length-this->motif_length),0);
           vector<double>   gamma_rev (this->data.size()*(this->max_data_length-this->motif_length),0);
           vector<double>   gamma_zer (this->data.size(),0);
           vector<double>   joint_dist(this->data.size() * 3, 0);
           vector<unsigned> sample(iteration * this->data.size(), 0);
           vector<char>     orientation(iteration * this->data.size(), 'z');

           for(unsigned short t = 0; t < iteration; t++) {
                   this->calculate_all_gammas(&gamma_for, &gamma_rev, &gamma_zer, &joint_dist);
                   this->sample_from_gammas  (&gamma_for, &gamma_rev, &gamma_zer, &joint_dist, generator);
                   this->sample_foreground   (pseudocounts, generator);                   
                   for(unsigned i = 0; i < this->data.size(); i++) {
                           vector<double> tmp_vec {joint_dist[i*3+0], joint_dist[i*3+1], joint_dist[i*3+2]};
                           this->log_likelihood[t] += this->log_of_sums(tmp_vec);;
                           sample[t*this->data.size() + i]      = this->start_pos[i];
                           orientation[t*this->data.size() + i] = this->pos_strain[i];
                   }
                   if( (t % 20 == 0 && t < burn_in && t > 0) || t+1 == burn_in)
                   this->make_shift(pseudocounts, generator, t, burn_in, &log_likelihood[t]);
                   if( max < log_likelihood[t] ) {
                   this->print_motif(this->start_pos, this->pos_strain, "POS-1.txt", pseudocounts);
                   max = log_likelihood[t];
                   }
           }
      }

      void print_motif(vector<unsigned> &tmp_start_pos, vector<char> &strain, string name, double pseudocounts) {
           vector<double> tmp_foreground(this->motif_length * this->alphabet.size(), pseudocounts);
           for(unsigned i = 0; i < tmp_start_pos.size(); i++) {
                   if(strain[i] == 'f')
                   for(unsigned short w = 0; w < this->motif_length; w++)
                   tmp_foreground[w * this->alphabet.size() + this->hash(this->data[i][tmp_start_pos[i] + w])]++;
                   if(strain[i] == 'r')
                   for(unsigned short w = 0; w < this->motif_length; w++)
                   tmp_foreground[(this->motif_length-1-w) * this->alphabet.size() + this->alphabet.size()-1-this->hash(this->data[i][tmp_start_pos[i] + w])]++;   
           }
           for(unsigned short w = 0; w < this->motif_length; w++) {
                   double tmp_sum = 0;
                   for(unsigned short a = 0; a < this->alphabet.size(); a++)
                   tmp_sum += tmp_foreground[w * this->alphabet.size() + a];
                   for(unsigned short a = 0; a < this->alphabet.size(); a++)
                   tmp_foreground[w * this->alphabet.size() + a] = log(tmp_foreground[w * this->alphabet.size() + a] / tmp_sum);
           }
           fstream result;
           result.open(name.c_str(), ios::out);
           for(unsigned short a = 0; a < this->alphabet.size(); a++) {
                   for(unsigned short w = 0; w < this->motif_length; w++)
                   result << exp(tmp_foreground[w * this->alphabet.size() + a]) << "\t";
                   result << endl;
           }
           result.close();
      }      
};

int main(int argc, char *argv[]) {

   char my_alphabet[] = {'a','c','g','t'};               
   vector<char> alphabet (my_alphabet,my_alphabet+sizeof(my_alphabet)/sizeof(char));

   vector<vector<double> > lh;
   default_random_engine   generator;
   double                  max_value = -1 * numeric_limits<double>::max();;

   for(unsigned short i = 0; i < atoi(argv[3]); i++) {
           ds_ZOOPS *tmp = new ds_ZOOPS(argv[1], atoi(argv[2]), alphabet, atoi(argv[7]), generator, atoi(argv[4]));
           tmp->gibbs(atoi(argv[4]), i, atoi(argv[5]), atof(argv[6]), generator, max_value);
           lh.push_back(tmp->log_likelihood);
   }

   /////////////////////////////////////////////////////////////////////////////
   // Ausgabe der Likelihoods und der Bayes-Factoren
   fstream result2;
   string name2 = "Likelihood-gipps.txt";
   result2.open(name2.c_str(), ios::out);
   for(unsigned i = 0; i < lh.size(); i++) {
           for(int j = 0; j < lh[i].size(); j++)
           result2 << lh[i][j] << " ";
           result2 << endl;
   }
   result2.close();

   return 0;
}

SeppJ

ComputerCarl schrieb:

Fragen an SeppJ:
-> "Überflüssige, selbstverschuldete Speicherlöcher im Bereich von Zeile 666."
Das war ja im Ursprungs-Code. Leider weiss ich nicht was du damit meinst. Könntest du das bitte etwas einfacher für micht formulieren?

Benutz kein new. Benutz keine Pointer.

^(Irgendwann, in ferner Zukunft, kommt vielleicht der Moment, an dem du Pointer/new brauchst. Bis dahin weißt du hoffentlich, wie man damit umgeht. Hier jedenfalls ist es einfach nur falsch.)^

-> "Die typischen Logikfehler bei Leseaktionen"
Auch das klingt sehr wichtig und ich weiss nicht was du meinst.

Was ist an folgendem Ablauf falsch?
-Prüfen ob Fehler aufgetreten ist
-Fehleranfällige Aktion durchführen
-Ergebnis verarbeiten
-Zurück zum Anfang

So machst du es aber. In C-artigen Sprachen sieht eine Leseschleife so aus:

Schleife solange (Leseaktion erfolgreich)
{
  verarbeite Ergebnis;
}

Deine Schleife ist ein ganz typisches Alarmsignal für einen schlechten Lehrer/Buchautor. Das ist Grundlagenwissen, wird aber trotzdem oft falsch gezeigt. Von diesem Lehrer/Autor solltest du dich sofort(!) trennen.

-> "dilletantische Umwandlung von Text in Zahlen"
eine andere Umwandlung wäre wohl mit Hilfe von maps oder so. Was wäre da erfahrungsgemäß eine der schnelleren und besten Umwandlungen.

Du brauchst keine Geschwindigkeit beim Lesen/Umwandeln. Was du hingegen bräuchtest, wären Fehlerprüfungen.

Allgemein muss ich den Code irgendwie schneller hinbekommen, da die Datenmengen die ich bearbeiten muss ziemlich groß sind. Eine Berechnungszeit von bis 3 Tagen wäre o.k.,
aber ich brauche wohl für meinen Datensatz eine Woche.

Vermutlich kann man den Algorithmus an sich verbessern. Dafür müsste man aber wissen, was du warum machst. Dies wird vermutlich die Möglichkeiten des Forums übersteigen.

Allgemein fällt mir auf, dass du Dinge sehr oft mehrmals machst, die du nur einmal machen bräuchtest und danach das Ergebnis benutzen könntest.

Jeder Tipp wäre hilfreich und bei Fragen zum Code, der Thematik oder der Daten würde ich gerne Auskunft geben.

Guck dir mal an, was ein Profiler ist, damit du nicht länger so komischen Mythen anhängst, was dein Programm schneller macht. Ich fürchte allerdings, du brauchst mehr Erfahrung, um das Ergebnis zu verstehen. Die kann ich dir nicht geben.