Single-cell RNA-seq data has become a critical source in revealing cellular activities. However, the developing probing techniques and the fatal damages to the detected cells incur various kinds of noise, e.g. the batch effect and the absence of cellular correspondence between experimental groups. Therefore, many single-cell tasks are better modeled as generative rather than discriminative tasks, since, instead of the exact cell-wise ground truth, only the distribution of cellular profiles under a certain condition is measurable. Considering the highly nonlinear and complex associations between gene expressions, we developed scVDM, a latent diffusion model integrated with a transformer-based conditional denoiser to learn three different generative tasks in single-cell data, including conditional data generation, batch effect correction and drug perturbation prediction. The high dimensional transcriptomic data are firstly projected to the latent space through a conditional VAE and then the complicated relationships between latent dimensions are deeply exploited through self-attentions to generate realistic diffusion noise. Based on the evaluation of five real-world datasets, our method demonstrates outstanding performance through comprehensive experimental results in all generative tasks.