Cuda Programm

Ich muss für einen Kurs ein Cuda-Programm schreiben, welches parallel die Wurzeln von 1 bis N berechnet. Dazu soll es ein 2D-Grid und 1D-Blöcke verwenden.

Ich habe nun meinen Versuch fertig, aber es tut sich einfach nichts:

Als Ausgabe meines Programmes erhalte ich immer
1
...
N

Hier der Code

#include "stdio.h"

__global__ void gpuSquareRoot(float* a)
{
  int id = threadIdx.x + blockIdx.x * blockDim.x + gridDim.x * blockDim.x * blockIdx.y;

  a[id] = sqrt( (float) id);
}

int main()
{
  const int N = 50;

  // getting memory on the host
  float *h_array = (float*) malloc(N * sizeof(float));

  // now for the gpu
  float *d_array;
  cudaMalloc( (void**) & d_array, N * sizeof(float));

  // init the host-array i.e. filling with values
  for(int i = 0; i< N; i++)
    h_array[i] = i+1;

  // copy values to device
  cudaMemcpy( d_array, h_array, N * sizeof(float), cudaMemcpyHostToDevice);

  //////////////////////////////////////////////////////
  // setting up the grid- and blocksizes
  //////////////////////////////////////////////////////
  dim3 grid, block;

  // 2D grid  =   512*512 blocks
  grid.x = 512;
  grid.y = 512;
  grid.z = 0;

  // 1d block =  512 threads
  block.x = 512;
  block.y = 0;
  block.z = 0;
  //////////////////////////////////////////////////////

  // start the kernel
  gpuSquareRoot<<<grid, block>>>(d_array);

  // copy values back to host
  cudaMemcpy( h_array, d_array, N * sizeof(float), cudaMemcpyDeviceToHost);

  // print some values
  for(int i= 0; i < N; i++)
  {
    printf("%f\n", h_array[i]);
  }

  // free the memory
  cudaFree(d_array);
  free(h_array);
  return 0;
}

Was mache ich falsch?

Ich habe nun die nicht verwendeten Dimensionen auf 1 statt 0 gesetzt.

Leider scheint aber meine id-berechnung immer noch fehlerhaft.

Unspecified Lauch-Error nun

Close plz, auch die letzten Fehler nun gefunden!