Skip to content

Supported Document Types

ChronoScope intelligently processes multiple document formats to extract timeline events.


Overview

ChronoScope uses AI-powered document classification to automatically detect document types and optimize extraction strategies. Understanding which document types work best helps you prepare files for optimal results.


Supported File Formats

Fully Supported Formats

Format Extension Support Level Best For
PDF .pdf ✅ Excellent Resumes, CVs, reports
Text .txt ✅ Excellent Plain text timelines
Word .docx ✅ Good Microsoft Word documents
Markdown .md ✅ Good Structured text files

Limited Support

Format Extension Support Level Notes
RTF .rtf ⚠️ Limited May need conversion
ODT .odt ⚠️ Limited OpenDocument format
HTML .html ⚠️ Limited Plain HTML only

Not Supported

Format Extension Why Not Supported Workaround
Images .jpg, .png, .gif No text extraction Use OCR tool first
Scanned PDFs .pdf (image-based) No selectable text Run through OCR
PowerPoint .pptx Complex layout Export as PDF
Excel .xlsx Structured data format Export as CSV, then import

PDF Processing (Dual-Library Approach)

How It Works

ChronoScope uses two PDF extraction libraries with automatic fallback:

  1. PyMuPDF (primary) - Fast, reliable, works for most PDFs
  2. pdfplumber (fallback) - Better for complex layouts and tables

Automatic selection:

User uploads PDF
Try PyMuPDF extraction
Success? → Return extracted text
Failed? → Try pdfplumber
Success? → Return extracted text
Failed? → Show error (likely image-based PDF)

PDF Requirements

For best results, PDFs must have:

Selectable text - You can highlight and copy text ✅ Structured layout - Clear sections and paragraphs ✅ Readable fonts - Standard fonts, not decorative ✅ No encryption - Unprotected files

Testing if your PDF will work:

  1. Open PDF in viewer (Preview, Adobe Reader, etc.)
  2. Try selecting text with cursor
  3. If you can copy-paste text → ✅ Will work!
  4. If text is not selectable → ❌ Need OCR first

Image-Based PDFs

If your PDF is a scanned image:

Option 1: Use OCR software

# Example with ocrmypdf (Mac/Linux)
brew install ocrmypdf
ocrmypdf input.pdf output.pdf

# Example with Adobe Acrobat (Windows/Mac)
File  Export To  Text (OCR if needed)

Option 2: Online OCR services


Document Type Detection

ChronoScope automatically classifies documents into types for optimized extraction.

Detected Types

1. Resume/CV

Characteristics: - Section headers: "Experience", "Education", "Skills" - Date ranges (Month Year - Month Year) - Job titles and company names - Bullet points with achievements

Extraction strategy: - Focus on work experience and education sections - Extract positions, companies, dates, locations - Parse education degrees and institutions - High confidence for explicit dates

Example structure:

PROFESSIONAL EXPERIENCE

Senior Software Engineer
TechCorp Inc., San Francisco, CA
June 2020 - Present
• Led development team of 5 engineers
• Built ML pipeline processing 1M+ events/day


2. Cover Letter

Characteristics: - Narrative format (paragraphs) - Mentions of specific experiences - Date references within text - Personal storytelling

Extraction strategy: - Parse narrative for timeline references - Extract mentioned companies/roles - Infer dates from contextual clues - Moderate confidence (dates often implicit)

Example structure:

During my three years at Google (2018-2021), I developed
a passion for machine learning. This led me to pursue
graduate studies at Stanford in 2021...


3. Personal Statement

Characteristics: - Academic focus - Research mentions - Publication references - Conference presentations

Extraction strategy: - Extract research projects and publications - Parse academic institutions and degrees - Identify grants, awards, conferences - High confidence for publications (explicit dates)

Example structure:

My research on neural networks, published in NeurIPS 2022,
built upon work I conducted during my PhD at MIT (2018-2022).
I presented preliminary findings at ICML 2021...


4. General Document

Characteristics: - Unstructured text - Mixed content types - No clear timeline format - Informal writing

Extraction strategy: - Broad pattern matching for dates - Extract any temporal references - Lower confidence scores - Manual review recommended


Extraction Quality by Document Type

Resume/CV: ⭐⭐⭐⭐⭐ Excellent

Success rate: 90-95%

Why it works well: - Explicit date formats - Structured sections - Clear temporal order - Standardized formatting

Typical extraction: - 10-20 events per page - 85-95% confidence scores - Complete date ranges - Locations and people

Optimization tips: - Use standard section headers - Include month/year for dates - Separate entries with spacing - Add locations for each role


Cover Letter: ⭐⭐⭐⭐ Very Good

Success rate: 70-85%

Why it works: - Narrative timeline structure - Specific experience mentions - Clear transitions between roles

Challenges: - Dates often implicit ("three years ago") - Locations may be missing - Less structured format

Typical extraction: - 5-10 events per document - 70-85% confidence scores - Some date ranges missing - Fewer location details

Optimization tips: - Mention explicit dates when possible - Reference specific companies/institutions - Use temporal markers ("In 2020...") - Include locations in narrative


Personal Statement: ⭐⭐⭐⭐ Very Good

Success rate: 75-90%

Why it works: - Academic rigor with dates - Structured achievements - Publication references

Challenges: - Mixed chronological order - Research spans overlap - Complex temporal relationships

Typical extraction: - 8-15 events per document - 75-90% confidence scores - Good date coverage - Strong institutional references

Optimization tips: - Include dates for all milestones - Reference publications with years - Mention conference dates - List institutions and locations


General Document: ⭐⭐⭐ Good

Success rate: 50-70%

Why it's challenging: - Unstructured content - Implicit timelines - Mixed topics - Varied date formats

Typical extraction: - 2-10 events per document - 50-70% confidence scores - Many date ranges missing - Locations often absent

Optimization tips: - Add explicit dates where possible - Structure into sections - Use clear headers - Separate events with spacing


Document Preparation Best Practices

Date Formatting

Optimal formats:

✅ Excellent:
- January 2020 - March 2023
- Jan 2020 - Mar 2023
- 01/2020 - 03/2023
- 2020-01 to 2023-03

✅ Good:
- 2020 - 2023 (year only)
- Q1 2020 - Q4 2023
- Spring 2020 - Summer 2023

⚠️ Acceptable but less optimal:
- "Three years" (requires reference point)
- "Recently" (too vague)
- "2020s" (decade reference)

❌ Avoid:
- "a while ago"
- "during college"
- "around that time"

Section Headers

Recommended headers for resumes:

Work Experience / Professional Experience / Employment History
Education / Academic Background
Projects / Key Projects
Publications / Research
Skills / Technical Skills
Certifications / Licenses
Awards / Honors / Achievements

Headers ChronoScope recognizes: - Experience, Work, Employment, Career - Education, Academic, School, University - Projects, Portfolio, Work Samples - Publications, Research, Papers - Awards, Honors, Achievements, Recognition

Content Structure

Example: Well-structured resume entry

Senior Software Engineer                    ← Job title (clear)
Google LLC, Mountain View, CA              ← Company, Location
June 2020 - Present                        ← Dates (explicit)

• Led team of 5 engineers                  ← Bullet points
• Built ML pipeline handling 1M+ events    ← Quantified achievements
• Deployed to production in 6 months       ← Specific timeline

Example: Poorly-structured entry

Worked at Google for a while doing engineering stuff.
Was part of a team. Did some machine learning work.

File Size Limitations

File Type Max Size Optimal Size Notes
PDF 10 MB < 2 MB Large files may timeout
TXT 5 MB < 500 KB Plain text loads fastest
DOCX 10 MB < 3 MB Complex formatting adds size
MD 5 MB < 1 MB Markdown is lightweight

Handling Large Files

If file > 10 MB:

  1. Compress PDF:

    # Using Ghostscript
    gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
       -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH \
       -sOutputFile=output.pdf input.pdf
    

  2. Split into sections:

  3. Upload work experience separately
  4. Upload education separately
  5. Merge in ChronoScope

  6. Remove images:

  7. Use text-only export
  8. Remove embedded images
  9. Keep only textual content

Multi-Language Support

Supported Languages

Language Support Level Notes
English ✅ Full support Primary language, best results
Spanish ⚠️ Partial support Date formats may vary
French ⚠️ Partial support Accented characters supported
German ⚠️ Partial support Works with standard dates
Chinese ❌ Limited Date formats differ significantly
Japanese ❌ Limited Complex date representations

Best Practices for Non-English

For best results:

  1. Use ISO date format: YYYY-MM-DD
  2. Translate key sections: Experience, Education
  3. Keep location names in English: "Paris, France" vs "Paris, Francia"
  4. Use English month names when possible: "Jan 2020" vs "Janv 2020"

Example (mixed language):

Ingénieur Logiciel Senior          ← French title (OK)
Google LLC, Paris, France           ← English location (Good)
January 2020 - Present              ← English dates (Best)

Troubleshooting by Document Type

PDFs Not Extracting

PDF upload succeeds but no text extracted

Likely cause: Image-based PDF (scanned document)

Diagnosis: 1. Open PDF in viewer 2. Try selecting text 3. If can't select → Image-based

Fix: 1. Use OCR software (see PDF Processing) 2. Re-upload processed file


Low Confidence Scores

Events extracted but confidence < 70%

Likely causes: - Implicit dates ("a few years ago") - Unstructured narrative - Missing key information

Fix: 1. Add explicit dates 2. Use structured format 3. Include locations and company names 4. Separate events clearly


Missing Events

Some obvious events not extracted

Likely causes: - Unusual date format - Non-standard section headers - Merged entries (multiple jobs in one paragraph)

Fix: 1. Use standard date formats (Month Year - Month Year) 2. Use recognized section headers (Experience, Education) 3. Separate entries with clear spacing 4. One event per paragraph/bullet


Duplicate Events

Same event extracted multiple times

Likely cause: Event mentioned in multiple documents or sections

Fix: 1. Use duplicate detection: 🔍 Validation tab 2. Review similarity score 3. Merge or delete duplicates 4. Improve source document clarity


Document Type Comparison

Aspect Resume Cover Letter Personal Statement General
Structure High Medium Medium Low
Date Precision High Medium Medium Low
Location Info High Medium High Low
Event Count High Medium Medium Low
Confidence 85-95% 70-85% 75-90% 50-70%
Best Use Career timeline Narrative context Academic history Supplemental

Recommendations by Use Case

Career Portfolio

Best document types: - Professional resume (primary) - LinkedIn export (supplementary) - Cover letters (for narrative context)

Avoid: - General documents - Informal notes


Academic Timeline

Best document types: - Academic CV (primary) - Personal statements (secondary) - Publication lists (supplementary)

Avoid: - Non-academic documents - Mixed-purpose files


Life Story

Best document types: - Chronological life summary (create for this purpose) - Personal statements - Diary exports (if structured)

Avoid: - Unstructured journal entries - Stream-of-consciousness writing


Next Steps

After understanding document types:


Quick Reference

Task Recommendation
Upload resume PDF or DOCX format, standard sections
Check if PDF works Try selecting text in viewer
Convert scanned PDF Use OCR software
Optimize dates Use "Month Year - Month Year" format
Improve confidence Add explicit dates and locations
Handle large files Compress or split into sections
Non-English docs Use ISO dates, English locations

Pro Tip

Create a "Timeline-Optimized" version of your resume specifically for ChronoScope: explicit dates, clear sections, separated entries, complete locations. Keep your original resume as-is for job applications, but use the optimized version for timeline extraction.


Back to Documentation Home