Co-salient object detection (CoSOD) is the task of identifying and emphasizing the common salient objects in a collection of images. The current co-salient object detection frameworks often extract features and model interimage relations separately. Although these methods achieve promising performance in many scenes, separating the feature extraction and relation modeling falls short of obtaining discriminative features for co-salient objects, resulting in subperformance, especially in some complex and cluttered real-world scenes. In this article, we introduce a novel CoSOD framework to unify feature extraction and interimage relation modeling. We design an early token interaction module (ETIM) that bridges information flow between branches to simultaneously realize feature extraction and interimage information interaction. To further enhance our network's capability to distinguish co-salient objects from other irrelevant foreground objects, we introduce a pixel-to-group contrastive (PGC) learning method. This approach aids in eliminating the need for additional interaction modules while preserving features' discriminative power for co-salient objects. Our proposed CoSOD framework only includes a backbone embedded with ETIM, a decoder without interaction modules and a project head only used during the training phase. Extensive experiments on three challenging benchmarks, that is, CoCA, CoSOD3k, and Cosal2015, demonstrate that our proposed method can outperform current leading-edge models and achieve the new state-of-the-art. The source code is available at https://github.com/zhiwang98/LDRNet.