Falls among the elderly are a major public health concern due to serious injuries, long-term disabilities, and high healthcare costs. Accurate and timely fall detection is therefore essential for improving safety and supporting independent living. Current fall detection systems mainly rely on single-modality sensors such as accelerometers or vision-based systems. These approaches often suffer from limited generalization, susceptibility to noise, and reduced reliability in real-world conditions. This paper addresses these challenges by focusing on multimodal fusion of wearable and environmental signals to overcome the limitations of unimodal systems and deliver more robust fall detection. We propose a deep multimodal framework that integrates inertial measurement units (IMUs), EEG signals, and infrared sensors from the UP-Fall dataset. The methodology includes synchronized preprocessing, window-based segmentation, feature and decision-level fusion, and the design of advanced deep models: CNNs, LSTMs, hybrid CNN-LSTM, and Transformers with attention mechanisms. Experimental results using subject-wise cross-validation demonstrate that multimodal fusion consistently outperforms unimodal baselines. The Transformer-based model achieved the strongest balance between sensitivity and specificity, with improvements in F1-score and ROC-AUC over competing architectures. These findings confirm that attention-driven multimodal integration enables more reliable detection of falls, even under noisy conditions and across unseen subjects. The proposed framework provides a scalable and generalizable solution for fall detection, with implications for broader human activity recognition, smart healthcare monitoring, and the development of safer ambient-assisted living environments.