Brain-computer interface (BCI) offers a direct communication and control channel between the human brain and external devices, presenting new pathways for individuals with physical disabilities to operate robotic arms for complex tasks. However, achieving multiobject grasping tasks under low signal-to-noise ratio (SNR) consumer-grade EEG signals is a significant challenge due to the lack of robust decoding algorithms and precise visual tracking methods. This article proposes, ArmBCIsys, an integrated robotic arm system that combines a novel dual-branch frequency-enhanced network (DBFENet) to robustly decode EEG signals under noisy conditions with the high-precision vision-guided grasping module. The proposed DBFENet designs the scaling temporal convolution block (STCB) to extract multiscale spatiotemporal features from the time domain, while the designed DropScale projected Transformer (DSPT) utilizes discrete cosine transform (DCT) to capture key frequency-domain features, significantly improving decoding robustness. We fine-tune the masked-attention mask Transformer (Mask2Former) model on the Jacquard dataset and incorporate the multiframe centroid-intersection over union (IoU) tracking algorithm to build visual grasp segmenter (VisGraspSeg), enabling reliable segmentation and dynamic tracking for diverse daily objects. Experimental validations on both self-built code-modulated visual evoked potential (c-VEP) dataset (1344 samples) and two public c-VEP datasets demonstrate that DBFENet achieves the state-of-the-art recognition performance, and the system integrates the DBFENet and proposed vision-guided module and ensures stable multiobject selecting and automatic object grasping in dynamic environments, extending promising applications in healthcare robotics, assistive technology, and industrial automation. The self-built dataset has been made publicly accessible at https://github.com/wtu1020/ ArmBCIsys-Self-built-cVEP-Dataset.