File Upload - Import Your Research Collections¶

Quickly import large collections of academic papers using drag-and-drop file upload. Upload .txt or .csv files containing paper IDs from your research collections, literature reviews, or bibliographic databases.

🚀 Quick Start¶

Step 1: Prepare Your File¶

Text File (.txt)CSV File (.csv)

Create a simple text file with one paper ID per line:

649def34f8be52c8b66281af98ae884c09aef38f9
204e3073870fae3d05bcbc2f6a8e263d9b72e776
2b8a9c9c9d8f7e6d5c4b3a29f8e7d6c5b4a39f8e

Create a CSV file with paper IDs in the first column:

paper_id,title,source
649def34f8be52c8b66281af98ae884c09aef38f9,"Attention Is All You Need","literature_review"
204e3073870fae3d05bcbc2f6a8e263d9b72e776,"BERT: Pre-training","user_collection"

Step 2: Upload in Interface¶

Navigate to Data Import page in Streamlit
Select "Paper IDs" import method
Click "📁 File Upload" tab
Drag and drop your file or click "Choose a file"
Preview the paper IDs (first 10 shown)
Configure import settings
Click "▶️ Start Import"

Step 3: Monitor Progress¶

Watch real-time progress bars for import status
Monitor performance metrics (papers/second, success rate)
Review error logs for any issues
Check completion statistics when finished

📁 Supported File Formats¶

Text Files (.txt)¶

One paper ID per line - Simple format for paper ID lists
Comments supported - Lines starting with # are ignored
Empty lines ignored - Flexible formatting allowed
UTF-8 encoding - International character support

Example with comments:

# Machine Learning Survey Papers - Updated 2024
649def34f8be52c8b66281af98ae884c09aef38f9
204e3073870fae3d05bcbc2f6a8e263d9b72e776

# Additional papers from recent conference
2b8a9c9c9d8f7e6d5c4b3a29f8e7d6c5b4a39f8e

CSV Files (.csv)¶

First column contains paper IDs - Additional columns ignored
Header row supported - Column names can be included
Standard CSV format - Comma-separated values
Metadata preservation - Keep additional information in extra columns

Example with metadata:

paper_id,title,journal,year,notes
649def34f8be52c8b66281af98ae884c09aef38f9,"Attention Is All You Need","NIPS",2017,"Transformer architecture"
204e3073870fae3d05bcbc2f6a8e263d9b72e776,"BERT","NAACL",2019,"Bidirectional encoder"

📖 Step-by-Step Guide¶

Creating Compatible Files¶

From ZoteroFrom MendeleyFrom Google ScholarFrom Spreadsheet

Select papers in your Zotero library
Right-click → Export Items
Choose format: "CSV" or create custom format
Extract paper IDs from URLs or DOIs
Save as .txt with one ID per line

Go to File → Export
Choose "Plain Text List" format
Extract Semantic Scholar IDs from paper metadata
Create .txt file with extracted IDs
Validate format before upload

Perform your search in Google Scholar
Copy paper URLs from search results
Extract paper IDs from Semantic Scholar links
Create CSV with IDs and titles
Upload to platform using file interface

Create column with paper IDs
Add metadata columns (optional: titles, sources)
Save as CSV format
Ensure paper IDs are in first column
Test with small sample first

File Upload Process¶

Launch Streamlit app: streamlit run app.py
Open sidebar menu (hamburger icon if collapsed)
Navigate to "Data Management" → "Data Import"
Select import method: Choose "Paper IDs"

Upload Configuration¶

Basic SettingsContent OptionsQuality Filters

Max Papers: Limit total papers imported (1-10,000)
Batch Size: Papers processed per batch (recommended: 25-50)
API Delay: Time between requests (recommended: 1-2 seconds)

Include Citations: ✅ Import citation relationships (recommended)
Include Authors: ✅ Import author information
Include Venues: ✅ Import journal/conference data
Include References: ✅ Import reference relationships

Min Citations: Filter papers with fewer citations
Year Range: Publication year filtering
Quality Thresholds: Remove incomplete records

Upload Steps¶

Switch to "📁 File Upload" tab
Click "Choose a file" or drag-and-drop
Wait for file validation (automatic)
Review preview of first 10 paper IDs
Adjust settings if needed
Click "▶️ Start Import"

Progress Monitoring¶

Status indicator: 🟡 Pending → 🔵 In Progress → 🟢 Complete
Progress bars: Overall progress and current batch
Statistics: Papers processed, citations found, errors encountered
Performance metrics: Import speed, success rate, time remaining

📊 Sample Files¶

The platform provides sample files for testing:

Download Sample Files¶

Go to Data Import page
Select "Paper IDs" method
Click "📁 File Upload" tab
Expand "📁 Download Sample Files" section
Download either format:
sample_paper_ids.txt - Text format with 10 ML papers
sample_paper_ids.csv - CSV format with metadata

Sample File Contents¶

sample_paper_ids.txt:

# Sample Machine Learning Papers for Testing
649def34f8be52c8b66281af98ae884c09aef38f9
204e3073870fae3d05bcbc2f6a8e263d9b72e776
2b8a9c9c9d8f7e6d5c4b3a29f8e7d6c5b4a39f8e
# ... 7 more papers

sample_paper_ids.csv:

paper_id,title,source
649def34f8be52c8b66281af98ae884c09aef38f9,"Attention Is All You Need","transformer_survey"
204e3073870fae3d05bcbc2f6a8e263d9b72e776,"BERT: Pre-training","nlp_collection"
# ... 8 more papers with titles and sources

💡 Common Use Cases¶

Academic Research Workflows¶

📚 Literature Reviews🔬 Research Projects📊 Bibliometric Studies

Export from reference manager (Zotero, Mendeley, EndNote)
Create paper ID lists from systematic reviews
Import citation networks for meta-analysis
Track research evolution over time

Import dataset paper collections for reproducibility
Upload conference proceedings for field analysis
Process collaboration networks between research groups
Analyze venue-specific research trends

Import large paper collections for statistical analysis
Process citation databases for network metrics
Upload institutional publications for impact assessment
Analyze temporal research patterns

File Preparation Strategies¶

Quality ControlPerformance OptimizationError Prevention

Remove duplicate IDs before upload
Validate paper ID format (32-40 character strings)
Check for invalid characters (only alphanumeric allowed)
Test with small samples first

Split large files (>1000 papers) for better performance
Use descriptive filenames for organization
Add comments in .txt files explaining data source
Keep original and processed versions

Ensure UTF-8 encoding for international characters
Validate paper IDs exist in Semantic Scholar
Check network connection stability
Have adequate disk space for progress files

🔧 Technical Specifications¶

File Limits & Requirements¶

Specification	Limit	Recommendation
File Size	200MB max	<10MB for best performance
Paper Count	10,000 per import	Start with 100-500
Paper ID Length	32-40 characters	Standard Semantic Scholar format
Encoding	UTF-8	Ensure international compatibility
Line Endings	Any format	Unix, Windows, Mac supported

Validation Rules¶

Paper ID Format: Alphanumeric strings, 32-40 characters
No Duplicates: Within the same file (duplicates across imports are handled)
Valid Characters: Letters (a-z, A-Z) and numbers (0-9) only
Empty Handling: Empty lines and comment lines (#) are ignored

Error Handling¶

Invalid Format: Clear error messages with line numbers
File Read Errors: UTF-8 encoding problems automatically detected
Empty Files: Validation prevents empty or invalid imports
Malformed CSV: Pandas parsing with automatic error recovery

🚨 Troubleshooting¶

Upload Issues¶

File Upload FailuresImport ErrorsPerformance Issues

"No valid paper IDs found" - ✅ Check file format (one ID per line for .txt) - ✅ Verify paper ID length (32-40 characters) - ✅ Ensure alphanumeric characters only - ✅ Remove any extra whitespace or special characters

"Error reading file" - ✅ Save file with UTF-8 encoding - ✅ Check for corrupted or binary content - ✅ Try opening file in text editor first - ✅ Re-save from original application

"File too large" - ✅ Split into multiple smaller files - ✅ Remove unnecessary metadata columns - ✅ Compress using standard text compression - ✅ Upload in smaller batches

"Paper not found" errors - ⚠️ Some paper IDs may not exist in Semantic Scholar - ✅ Verify IDs are from Semantic Scholar, not other databases - ✅ Check for typos in paper ID strings - ✅ Test with known valid IDs first

"API rate limiting" errors - ✅ Increase API delay to 2-3 seconds - ✅ Reduce batch size to 10-25 papers - ✅ Add Semantic Scholar API key to .env - ✅ Wait a few minutes and retry

"Database connection" errors - ✅ Verify Neo4j database is running - ✅ Check .env file configuration - ✅ Test database connection separately - ✅ Ensure network connectivity

Slow upload processing - ✅ Check file size and reduce if needed - ✅ Verify stable internet connection - ✅ Close other applications using memory - ✅ Consider uploading during off-peak hours

Memory errors during import - ✅ Reduce batch size to 10-20 papers - ✅ Close other applications - ✅ Restart Streamlit application - ✅ Consider upgrading system memory

📈 Performance Tips¶

Optimal Configuration¶

For most use cases, these settings provide the best balance of speed and reliability:

# Recommended settings for file uploads
max_papers = 500           # Start small, increase gradually
batch_size = 25           # Good balance of speed and memory usage
api_delay = 1.5           # Avoid rate limiting while maintaining speed
include_citations = True  # Essential for network analysis
include_authors = True    # Valuable for collaboration analysis
min_citation_count = 5    # Focus on impactful papers

Large File Strategies¶

Test first: Upload 10-50 papers to validate process
Split files: Break >1000 papers into 500-paper chunks
Schedule uploads: Run large imports during off-hours
Monitor resources: Watch memory and CPU usage
Backup progress: Save progress files for resumable imports

Quality Assurance¶

Validate sources: Ensure paper IDs come from reliable sources
Check samples: Preview uploaded data before full import
Monitor error rates: Watch for patterns in failed imports
Document provenance: Keep notes about data sources and dates

🎉 Success Stories¶

File upload has enabled researchers to:

Import 2,000+ papers from systematic literature reviews in minutes
Process conference proceedings with complete citation networks
Upload bibliographic exports from institutional repositories
Batch import research datasets for reproducibility studies
Create curated paper collections for specific research domains

🔗 Integration with Other Features¶

After successful file upload:

Train ML models with your imported data
Analyze networks using community detection
Explore interactively with visualization tools
Generate reports for publications

Next Steps: - Data Import Pipeline - Advanced import features and configuration - Interactive Features - Using file upload in the web interface - Quick Start - Complete workflow after uploading data

Getting Started: - Demo Mode - Try file upload with sample data first (recommended!) - Installation - Platform setup requirements - Configuration - Database and environment setup

Ready to import? Start with the sample files to test the process, then upload your own research collections!

Need help? Check the troubleshooting section or visit the comprehensive Data Import guide for advanced options.